{"title": "PointDAN: A Multi-Scale 3D Domain Adaption Network for Point Cloud Representation", "book": "Advances in Neural Information Processing Systems", "page_first": 7192, "page_last": 7203, "abstract": "Domain Adaptation (DA) approaches achieved significant improvements in a wide range of machine learning and computer vision tasks (i.e., classification, detection, and segmentation). However, as far as we are aware, there are few methods yet to achieve domain adaptation directly on 3D point cloud data. The unique challenge of point cloud data lies in its abundant spatial geometric information, and the semantics of the whole object is contributed by including regional geometric structures. Specifically, most general-purpose DA methods that struggle for global feature alignment and ignore local geometric information are not suitable for 3D domain alignment. In this paper, we propose a novel 3D Domain Adaptation Network for point cloud data (PointDAN). PointDAN jointly aligns the global and local features in multi-level. For local alignment, we propose Self-Adaptive (SA) node module with an adjusted receptive field to model the discriminative local structures for aligning domains. To represent hierarchically scaled features, node-attention module is further introduced to weight the relationship of SA nodes across objects and domains. For global alignment, an adversarial-training strategy is employed to learn and align global features across domains. Since there is no common evaluation benchmark for 3D point cloud DA scenario, we build a general benchmark (i.e., PointDA-10) extracted from three popular 3D object/scene datasets (i.e., ModelNet, ShapeNet and ScanNet) for cross-domain 3D objects classification fashion. Extensive experiments on PointDA-10 illustrate the superiority of our model over the state-of-the-art general-purpose DA methods.", "full_text": "PointDAN: A Multi-Scale 3D Domain Adaption\n\nNetwork for Point Cloud Representation\n\n1Can Qin\u2217, 2Haoxuan You\u2217, 1Lichen Wang, 3C.-C. Jay Kuo, 1,4Yun Fu\n1Department of Electrical & Computer Engineering, Northeastern University\n\n2Department of Computer Science, Columbia University\n\n3Department of Electrical and Computer Engineering, University of Southern California\n\n4Khoury College of Computer Science, Northeastern University\nqin.ca@husky.neu.edu, haoxuan.you@columbia.edu,\n\nwanglichenxj@gmail.com, cckuo@sipi.usc.edu, yunfu@ece.neu.edu\n\nAbstract\n\nDomain Adaptation (DA) approaches achieved signi\ufb01cant improvements in a wide\nrange of machine learning and computer vision tasks (i.e., classi\ufb01cation, detection,\nand segmentation). However, as far as we are aware, there are few methods\nyet to achieve domain adaptation directly on 3D point cloud data. The unique\nchallenge of point cloud data lies in its abundant spatial geometric information, and\nthe semantics of the whole object is contributed by including regional geometric\nstructures. Speci\ufb01cally, most general-purpose DA methods that struggle for global\nfeature alignment and ignore local geometric information are not suitable for 3D\ndomain alignment. In this paper, we propose a novel 3D Domain Adaptation\nNetwork for point cloud data (PointDAN). PointDAN jointly aligns the global\nand local features in multi-level. For local alignment, we propose Self-Adaptive\n(SA) node module with an adjusted receptive \ufb01eld to model the discriminative\nlocal structures for aligning domains. To represent hierarchically scaled features,\nnode-attention module is further introduced to weight the relationship of SA nodes\nacross objects and domains. For global alignment, an adversarial-training strategy\nis employed to learn and align global features across domains. Since there is no\ncommon evaluation benchmark for 3D point cloud DA scenario, we build a general\nbenchmark (i.e., PointDA-10) extracted from three popular 3D object/scene datasets\n(i.e., ModelNet, ShapeNet and ScanNet) for cross-domain 3D objects classi\ufb01cation\nfashion. Extensive experiments on PointDA-10 illustrate the superiority of our\nmodel over the state-of-the-art general-purpose DA methods.1\n\n1\n\nIntroduction\n\n3D vision has achieved promising outcomes in wide-ranging real-world applications (i.e., autonomous\ncars, robots, and surveillance system). Enormous amounts of 3D point cloud data is captured by depth\ncameras or LiDAR sensors nowadays. Sophisticated 3D vision and machine learning algorithms\nare required to analyze its content for further exploitation. Recently, the advent of Deep Neural\nNetwork (DNN) has greatly boosted the performance of 3D vision understanding including tasks\nof classi\ufb01cation, detection, and segmentation[22, 9, 37, 41]. Despite its impressive success, DNN\nrequires massive amounts of labeled data for training which is time-consuming and expensive to\ncollect. This issue signi\ufb01cantly limits its promotion in the real world.\nDomain adaptation (DA) solves this problem by building a model utilizing the knowledge of label-rich\ndataset, i.e., source domain, which generalizes well on the label-scarce dataset, i.e., target domain.\n\n1The PointDA-10 data and of\ufb01cial code are uploaded on https://github.com/canqin001/PointDAN\n\u2217Equal Contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Comparison between 2D-based and 3D-based DA approaches.\n\nHowever, due to the shifts of distribution across different domains/datasets, a model trained on\none domain usually performs poorly on other domains. Most DA methods address this problem\nby either mapping original features into a shared subspace or minimizing instance-level distances,\nsuch as MMD, CORAL etc., to mix cross-domain features [2, 18, 31]. Currently, inspired by\nGenerative Adversarial Network (GAN) [12], adversarial-training DA methods, like DANN, ADDA,\nMCD etc., have achieved promising performance in DA and drawn increasing attentions [10, 32, 26].\nThey deploy a zero-sum game between a discriminator and a generator to learn domain-invariant\nrepresentations. However, most of the existing DA approaches mainly target on 2D vision tasks,\nwhich globally align the distribution shifts between different domains. While for 3D point cloud data,\nthe geometric structures in 3D space can be detailedly described, and different local structures also\nhave clear semantic meaning, such as legs for chairs, which in return combine to form the global\nsemantics for a whole object. As shown in Fig. 1, two 3D objects might be weak to align in global,\nbut would have similar 3D local structures, which are easier to be aligned. So it is urgently desired\nfor a domain adaptation framework to focus on local geometric structures in 3D DA scenario.\nTo this end, this paper introduces a novel point-based Unsupervised Domain Adaptation Network\n(PointDAN) to achieve unsupervised domain adaptation (UDA) for 3D point cloud data. The key to\nour approach is to jointly align the multi-scale, i.e., global and local, features of point cloud data in an\nend-to-end manner. Speci\ufb01cally, the Self-Adaptive (SA) nodes associated with an adjusted receptive\n\ufb01eld are proposed to dynamically gather and align local features across domains. Moreover, a node\nattention module is further designed to explore and interpret the relationships between nodes and their\ncontributions in alignment. Meanwhile, an adversarial-training strategy is deployed to globally align\nthe global features. Since there are few benchmarks for DA on 3D data ( i.e., point cloud) before, we\nbuild a new benchmark named PointDA-10 dataset for 3D vision DA. It is generated by selecting the\nsamples in 10 overlapped categories among three popular datasets (i.e., ModelNet [35], ShapeNet [3]\nand ScanNet [5]). In all, the contributions of our paper could be summarized in three folds:\n\nand globally align the 3D objects\u2019 distributions across different domains.\n\n\u2022 We introduce a novel 3D-point-based unsupervised domain adaptation method by locally\n\u2022 For local feature alignment, we propose the Self-Adaptive (SA) nodes with a node attention\nto utilize local geometric information and dynamically gather regional structures for aligning\nlocal distribution across different domains.\n\u2022 We collect a new 3D point cloud DA benchmark, named PointDA-10 dataset, for fair\nevaluation of 3D DA methods. Extensive experiments on PointDA-10 demonstrate the\nsuperiority of our model over the state-of-the-art general-purpose DA methods.\n\n2 Related Works\n\n2.1 3D Vision Understanding\n\nDifferent from 2D vision, 3D vision has various data representation modalities: multi-view, voxel\ngrid, 3D mesh and point cloud data. Deep networks have been employed to deal with the above\ndifferent formats of 3D data [29, 19, 36, 8]. Among the above modalities, point cloud, represented by\na set of points with 3D coordinates {x, y, z}, is the most straightforward representation to preserve\n3D spatial information. Point cloud can be directly obtained by LiDAR sensors, which brings a lot of\n3D environment understanding applications from scene segmentation to automatic driving. PointNet\n[22] is the \ufb01rst deep neural networks to directly deal with point clouds, which proposes a symmetry\nfunction and a spatial transform network to obtain the invariance to point permutation. However,\n\n2\n\nLichen-Figure12DDomain AdaptationModelNetScanNetSuccessalignmentWeakalignmentGlobalAlignmentLocalAlignment3DDomain AdaptationLegPlainCornerPlainCornerLeg\fFigure 2: Illustration of PointDAN which mainly consists of local-level and global-level alignment.\n\nlocal geometric information is vital for describing object in 3D space, which is ignored by PointNet.\nSo recent work mainly focuses on how to effectively utilize local feature. For instance, in PointNet++\n[23], a series of PointNet structures are applied to local point sets with varied sizes and local features\nare gathered in a hierarchical way. PointCNN [17] proposes \u03c7-Conv to aggregate features in local\npitches and applies a bottom-up network structure like typical CNNs. In 3D object detection tasks,\n[41] proposes to divide a large scene into many voxels, where features of inside points are extracted\nrespectively and a 3D Region Proposal Network (RPN) structure is followed to obtain detection\nprediction.\nIn spite of the broad usage, point cloud data has signi\ufb01cant drawbacks in labeling ef\ufb01ciency. During\nlabeling, people need to rotate several times and look through different angles to identify an object. In\nreal-world environment where point cloud data are scanned from LiDAR, it also happens that some\nparts are lost or occluded (e.g.tables lose legs), which makes ef\ufb01cient labeling more dif\ufb01cult. Under\nthis circumstance, a speci\ufb01c 3D point-based unsupervised domain adaptation method designed to\nmitigate the domain gap of source labeled data and target unlabeled data is extremely desired.\n\n2.2 Unsupervised Domain Adaptation (UDA)\n\nThe main challenge of UDA is that distribution shift (i.e., domain gap) exists between the target and\nsource domain. It violates the basic assumption of conventional machine learning algorithms that\ntraining samples and test samples sharing the same distribution. To bridge the domain gap, UDA\napproaches match either the marginal distributions [30, 21, 11, 33] or the conditional distributions [38,\n4] between domains via feature alignment. It addresses this problem by learning a mapping function\nf which projects the raw image features into a shared feature space across domains. Most of them\nattempt to maximizing the inter-class discrepancy, while minimize the intra-class distance in a\nsubspace simultaneously. Various methods, such as Correlation Alignment (CORAL) [31], Maximum\nMean Discrepancy (MMD) [2, 18], or Geodesic distance [13] have been proposed.\nApart from the methods aforementioned, many DNN-based domain adaptation methods have been\nproposed due to their great capacity in representation learning [14, 28, 16]. The key to these meth-\nods is to apply DNN to learn domain-invariant features through an end-to-end training scenario.\nAnother kind of approach utilizes adversarial training strategy to obtain the domain invariant rep-\nresentations [10, 32, 7, 24]. It includes a discriminator and a generator where the generator aims\nto fool the discriminator until the discriminator is unable to distinguish the generated features be-\ntween the two domains. Such approaches include Adversarial Discriminative Domain Adaptation\n(ADDA) [32], Domain Adversarial Neural Network (DANN) [10], Maximum Classi\ufb01er Discrepancy\n(MCD) [26] etc.\nMost of UDA methods are designed for 2D vision tasks and focus on the alignment of global image\nfeatures across different domains. While in 3D data analytical tasks, regional and local geometry\ninformation is crucial for achieving good learning performance. Zhou et al. [40] \ufb01rstly introduced\nUDA on the task of 3D keypoint estimation relying on the regularization of multi-view consistency\nterm. However, this method cannot be extended to more generalized tasks, i.e., classi\ufb01cation. In\n[27, 34], point cloud data are \ufb01rst projected into 2D images (bird-eye view or front view), and 2D DA\nmethods are applied, which would lose essential 3D geometric information. To this end, we propose\na generalized 3D point-based UDA framework. It well preserves the local structures and explores the\n\n3\n\nMaxPoolingEncoder\u2026Mid-levelIdentityConnectionSourceInterpolationGeneratorHigh-levelGlobal-level Alignment\ud835\udc6d\ud835\udfcf\ud835\udc6d\ud835\udfd0\u2026\u2026\u2026\ud835\udc3f\ud835\udc50\ud835\udc59\ud835\udc601\ud835\udc3f\ud835\udc50\ud835\udc59\ud835\udc602\ud835\udc3f\ud835\udc51\ud835\udc56\ud835\udc60AdversarialTrainingLichen-Figure2\u2026\ud835\udc41\u00d73\u2026\u2026TargetInterpolatedFeatureOriginFeature\u2026\u2026Farthest PointSamplingConcatAlignment\u2026Source SA NodeFeature\u2026Target SA NodeFeature\u0ddc\ud835\udc65\ud835\udc50\ud835\udc65\ud835\udc502\ud835\udc65\ud835\udc503\ud835\udc65\ud835\udc504\ud835\udc65\ud835\udc505SA NodeInitializationGeometric-GuidedShift LearningSA NodeUpdateNodeAttentionLocal-level Alignment\u2206\u0ddc\ud835\udc65\ud835\udc50\ud835\udc41\u00d7\ud835\udc37\fglobal correlations of all local features. Adversarial training strategies are further employed to locally\nand globally align the distribution shifts across the source and target domains.\n\n3 Proposed Model\n\n3.1 Problem De\ufb01nition and Notation\nIn 3D point-based UDA, we have the access to labeled source domain S = {(xs\ni=1 where\nj}nt\ni \u2208 Y = {1, ..., Y } with ns annotated pairs and target domain T = {xt\nj=1 of nt unlabeled data\nys\npoints. The inputs are point cloud data usually represented by 3-dimensional coordinates (x, y, z)\nj \u2208 X \u2282 RT\u00d73, where T is the number of sampling points of one 3D object, with the\nwhere xs\nsame label space Ys = Yt. It is further assumed that two domains are sampled from the distributions\nPs(xs\ni ) respectively while the i.i.d. assumption is violated due to the distribution\nshift Ps (cid:54)= Pt. The key to UDA is to learn a mapping function \u03a6 : X \u2192 Rd that projects raw inputs\ninto a shared feature space H spreadable for cross-domain samples.\n\ni , xt\ni ) and Pt(xt\n\ni , ys\n\ni )}ns\n\ni , ys\n\ni, yt\n\n3.2 Local Feature Alignment\n\nThe local geometric information plays an important role in describing point cloud objects as well as\ndomain alignment. As illustrated in Fig. 1, given the same \u201ctable\u201d class, the one from ScanNet misses\nparts of legs due to the obstacles through LiDAR scanning. The key to align these two \u201ctables\u201d is\nto extract and match the features of similar structures, i.e., plains, while ignoring the different parts.\nTo utilize the local geometric information, we propose to adaptively select and update key nodes for\nbetter \ufb01tting the local alignment.\nSelf-Adaptive Node Construction: Here we give the de\ufb01nition of node in point cloud. For each point\ncloud, we represent its n local geometric structures as n point sets {Sc|Sc = {\u02c6xc, xc1, ..., xck}, x \u2286\nR3}n\nc=1, where the c-th region Sc contains a node \u02c6xc and its surrounding k nearest neighbor points\n{xc1, ..., xck}. The location of a node decides where the local region is and what points are included.\nTo achieve local features, directly employing the farthest point sampling or random sampling to get\nthe center node is commonly used in previous work [23, 17]. These methods guarantee full coverage\nover the whole point cloud. However, for domain alignment, it is essential to make sure that these\nnodes cover the structures of common characteristics in 3D geometric space and drop the parts unique\nto certain objects. In this way, the local regions sharing similar structures are more proper to be\naligned, while the uncommon parts would bring a negative transfer in\ufb02uence.\nInspired by deformable convolution in 2D vision [6], we propose a novel geometric-guided shift\nlearning module, which makes the input nodes self-adaptive in receptive \ufb01eld for network. Different\nfrom Deformable Convolution where semantic features are used for predicting offset, we utilize the\nlocal edge vector as a guidance during learning. As show in Fig. 2, our module transforms semantic\ninformation of each edge into its weight and then we aggregate the weighted edge vectors together\nto obtain our predicted offset direction. Intuitively, the prediction shift is decided by the voting of\nsurrounding edges with different signi\ufb01cance. We \ufb01rst initialize the location of node by the farthest\npoint sampling over the point cloud to get n nodes, and their k nearest neighbor points are collected\ntogether to form n regions. For the c-th node, its offset is computed as:\n\nk(cid:88)\n\nj=1\n\n\u2206\u02c6xc =\n\n1\nk\n\n(RT (vcj \u2212 \u02c6vc) \u00b7 (xcj \u2212 \u02c6xc)),\n\n(1)\n\nwhere \u02c6x and xcj denote location of node and its neighbor point, so xcj \u2212 \u02c6xc means the edge direction.\nvcj and \u02c6vc are their mid-level point feature extracted from the encoder v = E(x|\u0398E) and RT is the\nweight from one convolution layer for transforming feature. We apply the bottom 3 feature extraction\nlayers of PointNet as the encoder E. \u2206\u02c6xc is the predicted location offset of the c-th node.\nAfter obtaining learned shift \u2206\u02c6xc, we achieve the self-adaptive update of nodes and their regions by\nadding shift back to node \u02c6xc and \ufb01nding their k nearest neighbor points:\n\n\u02c6xc = \u02c6xc + \u2206\u02c6xc,\n\n4\n\n(2)\n\n\fns(cid:88)\n\ni,j=1\n\nns,nt(cid:88)\n\ni,j=1\n\nnt(cid:88)\n\ni,j=1\n\n{xc1, ..., xck} = kN N (\u02c6xc|xj, j = 0, ..., M \u2212 1).\n\n(3)\n\n(cid:83) RT = R,\n\nThen the \ufb01nal node features \u02c6vc is computed by gathering all the point features inside their regions:\n(4)\n\nRG(vcj).\n\n\u02c6vc = max\nj=1,..,k\n\nwhere RG is the weight of one convolution layer for gathering point features in which RG\nand the output node features are employed for local alignment. For better engaging SA node features,\nwe also interpolate them back into each point following the interpolation strategy in [23] and fuse\nthem with the original point features from a skip connection. The fused feature is input into next-stage\ngenerator for higher-level processing.\nSA Node Attention: Even achieving SA nodes, it is unreasonable to assume that every SA node\ncontributes equally to the goal of domain alignment. The attention module, which is designed to\nmodel the relationship between nodes, is necessary for weighting the contributions of different SA\nnodes for domain alignment and capturing the features in larger spatial scales. Inspired by the channel\nattention [39], we apply a node attention network to model the contribution of each SA nodes for\nalignment by introducing a bottleneck network with a residual structure [14]:\n\nhc = \u03d5(WU \u03b4(WDzc)) \u00b7 \u02c6vc + \u02c6vc,\n\n(5)\nwhere zc = E(\u02c6vc(k)) indicates the mean of the c-th node feature. \u03b4(\u00b7) and \u03d5(\u00b7) represent the ReLU\nfunction [20] and Sigmoid function respectively. WD is the weight set of a convolutional layer with\n1 \u00d7 1 kernels, which reduces the number of channels with the ratio r. The channel-upscaling layer\nWU , where WU\nSA Node Feature Alignment: The optimization of both offsets and network parameters for local\nalignment are sensitive to the disturbance of gradients, which makes GAN-based methods perform\nunstable. Therefore, we minimize the MMD [2, 18] loss to align cross-domain SA node features as:\n\n(cid:83) WD = W, increases the channels to its original number with the ratio r.\n\nLmmd =\n\n1\n\nnsns\n\n\u03ba(hs\n\ni , hs\n\nj) +\n\n1\n\nnsnt\n\n\u03ba(hs\n\ni , ht\n\nj) +\n\n1\n\nntnt\n\n\u03ba(ht\n\ni, ht\n\nj),\n\n(6)\n\nwhere \u03ba is a kernel function and we apply the Radial Basis Function (RBF) kernel in our model.\n\n3.3 Global Feature Alignment\nAfter having the features fi \u2208 Rd corresponding to the i-th sample by a generator network, the global\nfeature alignment attempts to minimize the distance between features across different domains. In\ndifference of local feature alignment, global feature alignment process is more stable due to the\ninvariance of receptive \ufb01eld of inputs, which provides more options for choosing GAN-based methods.\nIn this paper, we apply Maximum Classi\ufb01er Discrepancy (MCD) [26] for global feature alignment\ndue to its outstanding performance in general-purpose domain alignment.\nThe encoder E designed for SA node feature extraction is also applied for extracting raw point\ncloud features: \u02dchi = E (xi|\u0398E) over the whole object. And the point features are concate-\nnated with interpolated SA-node features as \u02c6hi = [hi, \u02dchi] to capture the geometry information\nin multi-scale. Then, we feed the \u02c6hi to the generator network G which is the \ufb01nal convolution layer\n(i.e., conv4) of PointNet attached with a global max-pooling to achieve high-level global feature\nfi = max \u2212 pooling(G(\u02c6hi|\u0398G)), where fi \u2208 Rd represents the global feature of the i-th sample.\nAnd d is usually assigned as 1,024. The global alignment module attempts to align domains with\ntwo classi\ufb01er networks F1 and F2 to keep the discriminative features given the support of source\ndomain decision boundaries. The two classi\ufb01ers F1 and F2 take the features fi and classify them into\nK classes as pj(yi|xi) = Fj\n, j = 1, 2, where pj(yi|xi) is the K-dimensional probabilistic\nsoftmax results of classi\ufb01ers.\nTo train the model, the total loss is composed of two parts: the task loss and discrepancy loss. Similar\nas most UDA methods, the object of task loss is to minimize the empirical risk on source domain\n{Xs, Ys}, which is formulated as follows:\nLcls(Xs, Ys) = \u2212E(xs,ys)\u223c(Xs,Ys)\n\n1[k=ys]log(p((y = ys)|G(E(xs|\u0398E)|\u0398G))).\n\nK(cid:88)\n\nfi|\u0398j\n\n(cid:16)\n\n(cid:17)\n\n(7)\n\nF\n\nk=1\n\n5\n\n\fThe discrepancy loss is calculated as the l1 distance between the softmax scores of two classi\ufb01ers:\n(8)\n\nLdis(xt) = Ext\u223cXt[|p1(y|xt) \u2212 p2(y|xt)|].\n\n3.4 Training Procedure\n\nWe apply the Back-Propagation [25] to optimize the whole framework under the end-to-end training\nscenario. The training process is composed of two steps in total:\nStep1. Firstly, it is required to train two classi\ufb01ers F1 and F2 with the discrepancy loss Ldis in\nEq. (8) and classi\ufb01cation loss Lcls obtained in Eq. (7). The discrepancy loss, which requires to be\nmaximized, helps gather target features given the support of the source domain. The classi\ufb01cation\nloss is applied to minimize the empirical risk on source domain. The objective function is as follows:\n(9)\nStep2. In this step, we train the generator G, encoder E, the node attention network W and\ntransform network R by minimizing the discrepancy loss, classi\ufb01cation loss and MMD loss to\nachieve discriminative and domain-invariant features. The objective function in this step is formulated\nas:\n\nLcls \u2212 \u03bbLdis.\n\nmin\nF1,F2\n\nwhere both \u03bb and \u03b2 are hyper-parameters which manually assigned as 1.\n\nmin\n\nG,E,W,R Lcls + \u03bbLdis + \u03b2Lmmd,\n\n(10)\n\n3.5 Theoretical Analysis\nIn this section, we analyze our method in terms of the H\u2206H- distance theory [1]. The H\u2206H-distance\nis de\ufb01ned as\n\ndH\u2206H(S,T ) = 2\n\n|Px\u223cS [h1(x) (cid:54)= h2(x)] \u2212 Px\u223cT [h1(x (cid:54)= h2(x))]| ,\n\n(11)\nwhich represents the discrepancy between the target and source distributions, T and S, with regard to\nthe hypothesis class H. According to [1], the error of classi\ufb01er h on the target domain \u0001T (h) can be\nbounded by the sum of the source domain error \u0001S (h), the H\u2206H- distance and a constant C which is\nindependent of h, i.e.,\n\nh1,h2\u2208H\n\nsup\n\n\u0001T (h) \u2264 \u0001S (h) +\n\ndH\u2206H(S,T ) + C.\n\n(12)\nThe relationship between our method and the H\u2206H- distance will be discussed in the following. The\nH\u2206H- distance can also be denoted as below:\n\n(cid:12)(cid:12)Ex\u223cS 1[h1(x)(cid:54)=h2(x)] \u2212 Ex\u223cT 1[h1(x)(cid:54)=h2(x)]\n\ndH\u2206H(S,T ) = 2\n\n(cid:12)(cid:12) .\n\n(13)\n\n1\n2\n\nsup\n\nh1,h2\u2208H\n\nAs the term Ex\u223cS 1[h1(x)(cid:54)=h2(x)] would be very small if h1 and h2 can classify samples over S\ncorrectly.\nIn our case, p1 and p2 correspond to h1 and h2 respectively, which agree on their\npredictions on source samples S. As a result, dH\u2206H(S,T ) can be approximately calculated by\nsuph1,h2\u2208H Ex\u223cT 1[h1(x)(cid:54)=h2(x)], which is the supremum of Ldis in our problem. If decomposing\nthe hypothesis h1 into G and F1, and h2 into G and F2, and \ufb01x G, we can get\n\nsup\n\nh1,h2\u2208H\n\nEx\u223cT 1[h1(x)(cid:54)=h2(x)] = sup\n\nF1,F2\n\nEx\u223cT 1[F1\u25e6G(x)(cid:54)=F2\u25e6G(x)].\n\nFurther, we replace sup with max, and attempt to minimize (14) with respect to G:\n\nmin\n\nG\n\nmax\nF1,F2\n\nEx\u223cT 1[F1\u25e6G(x)(cid:54)=F2\u25e6G(x)].\n\n(14)\n\n(15)\n\nProblem (15) is similar to the problem (9,10) in our method. Consider the discrepancy loss Ldis,\nwe \ufb01rst train classi\ufb01ers F1, F2 to maximize Ldis on the target domain and next train generator G to\nminimize Ldis, which matches with problem (15). Although we also need consider the source loss\nLcls and MMD loss Lmmd, we can see from [1] that our method still has a close connection to the\nH\u2206H- distance. Thus, by iteratively train F1, F2 and G, we can effectively reduce dH\u2206H(S,T ),\nand further lead to the better approximate \u0001T (h) by \u0001S (h).\n\n6\n\n\fBathtub Bed Bookshelf Cabinet\n\nDataset\nM Train\nTest\nTrain\nS\nTest\nS* Train\nTest\n\n106\n50\n599\n85\n98\n26\n\n515\n100\n167\n23\n329\n85\n\n572\n100\n310\n50\n464\n146\n\nTable 1: Number of samples in proposed datasets.\nPlant\n240\n100\n158\n30\n88\n25\n\nLamp Monitor\n124\n20\n\nChair\n889\n100\n4, 612\n662\n2, 578\n801\n\n200\n86\n\n1, 076\n126\n650\n149\n\n1, 620\n232\n161\n41\n\n465\n100\n762\n112\n210\n61\n\nSofa\n680\n100\n2, 198\n330\n495\n134\n\nTable\n392\n100\n5, 876\n842\n1, 037\n301\n\nTotal\n4, 183\n856\n\n17, 378\n2, 492\n6, 110\n1, 769\n\n4 PointDA-10 Dataset\n\nAs there is no 3D point cloud bench-\nmark designed for domain adaptation,\nwe propose three datasets with dif-\nferent characteristics, i.e., ModelNet-\n10, ShapeNet-10, ScanNet-10, for the\nevaluation of point cloud DA methods.\nTo build them, we extract the sam-\nples in 10 shared classes from Mod-\nelNet40 [35], ShapeNet [3] and Scan-\nNet [5] respectively. The statistic and\nvisualization are shown in Table 1 and\nFig. 3. Given the access to the three\nsubdatasets, we organize six types of\nadaptation scenarios which are M \u2192\nS, M \u2192 S*, S \u2192 M, S \u2192 S*, S* \u2192\nM and S* \u2192 S respectively.\nModelNet-10 (M): ModelNet40 con-\ntains clean 3D CAD models of 40 categories. To extract overlapped classes, we regard \u2019nightstand\u2019\nclass in ModelNet40 as \u2019cabinet\u2019 class in ModelNet-10, because these two objects almost share the\nsame structure. After getting the CAD model, we sample points on the surface as [23] to fully cover\nthe object.\nShapeNet-10 (S): ShapeNetCore contains 3D CAD models of 55 categories gathered from online\nrepositories. ShapeNet contains more samples and its objects have larger variance in structure\ncompared with ModelNet. We apply uniform sampling to collect the points of ShapeNet on surface,\nwhich, compared with ModelNet, may lose some marginal points.\nScanNet-10 (S*): ScanNet contains scanned and reconstructed real-world indoor scenes. We isolate\n10 classes instances contained in annotated bounding boxes for classi\ufb01cation. The objects often lose\nsome parts and get occluded by surroundings. ScanNet is a challenging but realistic domain.\n\nFigure 3: Samples of PointDA-10 dataset.\n\n5 Experiments\n\n5.1 Experiments Setup\n\nIn this section, we evaluate the proposed method under the standard protocol [11] of unsupervised\ndomain adaptation on the task of point cloud data classi\ufb01cation.\nImplementation Details: We choose the PointNet [22] as the backbone of Encoder E and Generator\nG and apply a two-layer multilayer perceptron (MLP) as F1 and F2. The proposed approach is\nimplemented on PyTorch with Adam [15] as the optimizer and a NVIDIA TITAN GPU for training.\nThe learning rate is assigned as 0.0001 under the weight decay 0.0005. All models have been trained\nfor 200 epochs of batch size 64. We extract the SA node features from the third convolution layer\n(i.e., conv3) for local-level alignment and the number of SA node is assigned as 64.\nBaselines: We compare the proposed method with a serial of general-purpose UDA methods includ-\ning: Maximum Mean Discrepancy (MMD) [18], Adversarial Discriminative Domain Adaptation\n(ADDA) [32], Domain Adversarial Neural Network (DANN) [10], and Maximum Classi\ufb01er Discrep-\n\n7\n\nBathtubBedBookshelfCabinetChairPlantSofaTableModelNet-10ShapeNet-10ScanNet-10ModelNet-10ShapeNet-10ScanNet-10LampMonitor\fTable 2: Quantitative classi\ufb01cation results (%) on PointDA-10 Dataset.\n\nG L A P M\u2192S M\u2192S*\n22.3\n\u221a\n\u221a\n27.9\n\u221a\n29.4\n\u221a\n30.5\n31.0\n\u221a \u221a\n\u221a \u221a \u221a\n31.2\n\u221a \u221a \u221a \u221a\n32.1\n33.0\n53.2\n\n42.5\n57.5\n58.7\n61.0\n62.0\n62.5\n63.7\n64.2\n90.5\n\nS\u2192M S\u2192S*\n23.5\n39.9\n26.7\n40.7\n42.3\n30.5\n29.3\n40.4\n31.3\n41.4\n31.5\n41.5\n33.7\n44.5\n47.6\n33.9\n53.2\n86.2\n\nS*\u2192M S*\u2192S Avg\n34.9\n34.2\n42.5\n47.3\n48.1\n44.2\n43.5\n48.9\n45.3\n46.8\n45.5\n46.9\n47.5\n48.2\n49.1\n48.7\n76.6\n86.2\n\n46.9\n54.8\n56.7\n51.1\n59.3\n59.3\n63.0\n64.1\n90.5\n\nw/o Adapt\nMMD [18]\nDANN [10]\nADDA [32]\nMCD [26]\n\nOurs\n\nSupervised\n\nBathtub Bed Bookshelf Cabinet Chair Lamp Monitor\n\nTable 3: Class-wise classi\ufb01cation results (%) on ModelNet to ShapeNet.\nG L A P\n\u221a\n\u221a\n\u221a\n\u221a\n\u221a \u221a\n\u221a \u221a \u221a\n\u221a \u221a \u221a \u221a\n\n7.4\n1.6\n1.5\n2.4\n7.7\n1.6\n1.0\n1.3\n88.0\n\n55.7\n63.6\n72.1\n66.7\n74.9\n75.6\n79.0\n81.9\n96.6\n\n43.5\n58.4\n52.6\n62.8\n62.0\n61.2\n64.2\n63.3\n90.9\n\nPlant\n60.0\n83.4\n86.7\n70.1\n80.0\n86.3\n83.3\n82.3\n57.1\n\nSofa Table Avg\n37.3\n3.4\n48.2\n0.5\n48.6\n1.0\n1.8\n48.3\n50.2\n1.6\n50.6\n0.9\n51.3\n3.6\n51.0\n2.2\n92.7\n83.5\n\n39.7\n87.6\n80.2\n86.8\n82.2\n83.4\n83.0\n82.9\n91.1\n\n59.4\n77.1\n82.6\n84.5\n84.8\n84.6\n85.7\n84.7\n88.9\n\n1.0\n0.7\n0.4\n1.0\n4.4\n0.8\n2.4\n1.6\n88.6\n\n18.4\n20.0\n20.1\n22.9\n18.4\n19.2\n20.4\n19.0\n47.8\n\n84.8\n88.8\n90.2\n83.6\n85.6\n92.7\n90.1\n90.5\n93.7\n\nw/o Adapt\nMMD [18]\nDANN [10]\nADDA [32]\nMCD [26]\n\nOurs\n\nSupervised\n\nancy (MCD) [26]. During these experiments, we take the same loss and the same training policy. w/o\nAdapt refers to the model trained only by source samples and Supervised means fully supervised\nmethod.\nAblation Study Setup: To analyze the effects of each module, we introduce the ablation study which\nis composed of four components: global feature alignment,i.e., G, local feature alignment, i.e., L, SA\nnode attention ,i.e., A, and the self-training [42], i.e., P, to \ufb01netune the model with 10% pseudo target\nlabels generated from the target samples with the highest softmax scores.\nEvaluation: Given the labeled samples in source domain and unlabeled samples from target domain\nfor training, all the models would be evaluated on the test set of target domain. All the experiments\nhave been repeated three times and we then report the average top-1 classi\ufb01cation accuracy in all\ntables.\n\n5.2 Classi\ufb01cation Results on PointDA-10 Dataset\n\nThe quantitative results and comparison on PointDA-10 dataset are summarized in Table 2. The\nproposed methods outperform all the general-purpose baseline methods on all adaptation scenarios.\nAlthough the largest domain gap appears on M \u2192 S* and S \u2192 S*, ours exhibit the large improvement\nwhich demonstrates its superiority in aligning different domains. In comparison to the baseline\nmethods, MMD, although defeated by GAN-based methods in 2D vision tasks, is only slightly\ninferior and even outperforms them in some domain pairs. The phenomenon could be explained\nas global features limit the upper bound due to its weakness in representing diversi\ufb01ed geometry\ninformation. In addition, there still exists a great margin between supervised method and DA methods.\nThe Table 3 represents the class-wise classi\ufb01cation results on the domain pair M \u2192 S. Local alignment\nhelps boost the performance on most of the classes, especially for Monitor and Chair. However,\nsome of the objects, i.e., sofa and bed, are quite challenging for recognition under the UDA scenario\nwhere the negative transfer happens as the performance could drop on these classes. Moreover, we\nobserved that the imbalanced training samples do affect the performance of our model and other\ndomain adaptation (DA) models, which makes Table 3 slightly noisy. Chair, Table, and Sofa (easily\nconfusing with Bed) cover more than 60% samples in M-to-S scenario which causes the drop of\ncertain classes (e.g., Bed and Sofa).\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: (a)-(b) Matched SA nodes for aligning cross-domain objects. (c) Analysis of different\nfeature extraction layers for local feature alignment, and (d) convergence analysis.\n\n5.3 Quantitative Analysis\n\nAblation Study: We further analyze the effect of four components proposed in our model (i.e., G,\nL, S, A). From the Table 2, we \ufb01nd that together with SA node, adding local alignment will bring\nsigni\ufb01cant improvement, but only local alignment with \ufb01xed node wouldn\u2019t improve a lot. Above\nresults substantially validate the effectiveness of our SA nodes that attributes to its self-adapt in\nregion receptive \ufb01eld and signi\ufb01cant weight. And an interesting phenomenon in Table 3 is that the\nfull version method is defeated by G+L+A in class-wise accuracy. It means that inference of pseudo\nlabels is easily in\ufb02uenced by imbalance distribution of samples in different classes where certain\nclasses would dominate the process of self-training and cause errors accumulation.\nConvergence: We evaluate the convergence of proposed methods as well as baseline methods\non ModelNet-to-ShapeNet in Fig. 4(d). Compared with baselines methods, local alignment helps\naccelerate the convergence and makes them more stable since being convergent.\nSA Node Feature Extraction Layer: The in\ufb02uence of different layers for mid-level feature\nextraction is analyzed in Fig. 4(c) on M \u2192 S and S* \u2192 M. Compared with conv1 and conv2 whose\nfeatures are less semantical, conv3 contains the best mid-level feature for local alignment.\n\n5.4 Results Visualization\n\nWe visualize the top contributed SA nodes for local alignment of two cross-domain objects to interpret\nthe effectiveness of local feature alignment in Fig. 4(a)-4(b). The matched nodes are selected from the\n(cid:62) \u2208 R64\u00d764 obtained from Eq. 5. It is\nelements with the highest values from the matrix M = hs\neasily observed that the SA nodes representing similar geometry structure, i.e., legs, plains, contribute\nmost to local alignment no matter they are between same objects or different objects across domains.\nIt signi\ufb01cantly demonstrates the common knowledge learned by SA nodes for local alignment.\n\ni \u00d7 ht\n\nj\n\n6 Conclusion\n\nIn this paper, we propose a novel 3D Unsupervised Domain Adaptation Network on Point Cloud Data\n(PointDAN). PointDAN is a speci\ufb01cally designed framework based on multi-scale feature alignment.\nFor local feature alignment, we introduce Self-Adaptive (SA) nodes to represent common geometry\nstructure across domains and apply a GAN-based method to align features globally. To evaluate the\nproposed model, we build a new 3D domain adaptation benchmark. In the experiments, we have\ndemonstrated the superiority of our approach over the state-of-the-art domain adaptation methods.\nAcknowledgements\nWe thank Qianqian Ma from Boston University for her helpful theoretical insights and comments for\nour work.\n\nReferences\n[1] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of\n\nlearning from different domains. Machine learning, 79(1-2):151\u2013175, 2010.\n\n9\n\nM -> SS* -> MDA Direction3540455055606570AccuracyConv1Conv2Conv3050100150Iteration Epoch0.30.40.50.6AccuracyMMDDANNADDAMCDG+LG+L+AG+L+A+Pse\f[2] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch\u00f6lkopf, and A. J. Smola.\nIntegrating structured biological data by kernel maximum mean discrepancy. Bioinformatics,\n22(14):e49\u2013e57, 2006.\n\n[3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva,\nS. Song, H. Su, et al. ShapeNet: An information-rich 3D model repository. arXiv preprint\narXiv:1512.03012, 2015.\n\n[4] N. Courty, R. Flamary, A. Habrard, and A. Rakotomamonjy. Joint distribution optimal transporta-\ntion for domain adaptation. In Proceedings of the Advances in Neural Information Processing\nSystems, pages 3730\u20133739, 2017.\n\n[5] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nie\u00dfner. ScanNet: Richly-\nannotated 3D reconstructions of indoor scenes. In Proceedings of the IEEE Conference on\nComputer Vision and Pattern Recognition, pages 5828\u20135839, 2017.\n\n[6] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional\nnetworks. In Proceedings of the IEEE International Conference on Computer Vision, pages\n764\u2013773, 2017.\n\n[7] J. Dong, Y. Cong, G. Sun, and D. Hou. Semantic-transferable weakly-supervised endoscopic\nIn Proceedings of the IEEE International Conference on Computer\n\nlesions segmentation.\nVision, 2019.\n\n[8] Y. Feng, Y. Feng, H. You, X. Zhao, and Y. Gao. MeshNet: mesh neural network for 3D shape\nrepresentation. In Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, volume 33,\npages 8279\u20138286, 2019.\n\n[9] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao. GVCNN: Group-view convolutional neural\nIn Proceedings of the IEEE Conference on Computer\n\nnetworks for 3D shape recognition.\nVision and Pattern Recognition, pages 264\u2013272, 2018.\n\n[10] Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. arXiv\n\npreprint arXiv:1409.7495, 2014.\n\n[11] B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively\nlearning domain-invariant features for unsupervised domain adaptation. In Proceedings of the\nInternational Conference on Machine Learning, pages 222\u2013230, 2013.\n\n[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Proceedings of the Advances in Neural Information\nProcessing Systems, pages 2672\u20132680, 2014.\n\n[13] R. Gopalan, R. Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised\napproach. In Proceedings of the IEEE International Conference on Computer Vision, pages\n999\u20131006, 2011.\n\n[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.\n\nIn\n\n[15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional\nneural networks. In Proceedings of the Advances in Neural Information Processing Systems,\n2012.\n\n[17] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. PointCNN: Convolution on X-transformed\nIn Proceedings of the Advances in Neural Information Processing Systems, pages\n\npoints.\n820\u2013830, 2018.\n\n[18] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu. Transfer feature learning with joint distribution\n\nadaptation. In Proceedings of IEEE International Conference on Computer Vision, 2013.\n\n10\n\n\f[19] D. Maturana and S. Scherer. VoxNet: A 3D convolutional neural network for real-time object\nrecognition. In Proceedings of the IEEE International Conference on Intelligent Robots and\nSystems, pages 922\u2013928, 2015.\n\n[20] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nProceedings of the International Conference on Machine Learning, pages 807\u2013814, 2010.\n\n[21] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component\n\nanalysis. IEEE Transactions on Neural Networks, 22(2):199\u2013210, 2010.\n\n[22] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3D\nclassi\ufb01cation and segmentation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 652\u2013660, 2017.\n\n[23] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on\npoint sets in a metric space. In Proceedings of the Advances in Neural Information Processing\nSystems, pages 5099\u20135108, 2017.\n\n[24] C. Qin, L. Wang, Y. Zhang, and Y. Fu. Generatively inferential co-training for unsupervised\ndomain adaptation. In Proceedings of the IEEE International Conference on Computer Vision\nWorkshops, Oct 2019.\n\n[25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating\n\nerrors. Nature, 323(6088):533, 1986.\n\n[26] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada. Maximum classi\ufb01er discrepancy for unsu-\npervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 3723\u20133732, 2018.\n\n[27] K. Saleh, A. Abobakr, M. Attia, J. Iskander, D. Nahavandi, and M. Hossny. Domain adap-\ntation for vehicle detection from bird\u2019s eye view LiDAR point cloud data. arXiv preprint\narXiv:1905.08955, 2019.\n\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[29] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural networks\nfor 3D shape recognition. In Proceedings of the IEEE International Conference on Computer\nVision, pages 945\u2013953, 2015.\n\n[30] M. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe. Direct importance\nestimation with model selection and its application to covariate shift adaptation. In Proceedings\nof the Advances in Neural Information Processing Systems, pages 1433\u20131440, 2008.\n\n[31] B. Sun and K. Saenko. Deep coral: Correlation alignment for deep domain adaptation. In\n\nProceedings of the European Conference on Computer Vision, 2016.\n\n[32] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation.\n\nIn Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017.\n\n[33] L. Wang, Z. Ding, and Y. Fu. Low-rank transfer human motion segmentation. IEEE Transactions\n\non Image Processing, 28(2):1023\u20131034, 2019.\n\n[34] B. Wu, X. Zhou, S. Zhao, X. Yue, and K. Keutzer. SqueezeSegV2: Improved model structure\nand unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In\nProceedings of the International Conference on Robotics and Automation, pages 4376\u20134382,\n2019.\n\n[35] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D shapenets: A deep\nrepresentation for volumetric shapes. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 1912\u20131920, 2015.\n\n[36] H. You, Y. Feng, R. Ji, and Y. Gao. PVNet: A joint convolutional network of point cloud and\nmulti-view for 3d shape recognition. In Proceedings of the ACM Multimedia Conference on\nMultimedia Conference, pages 1310\u20131318, 2018.\n\n11\n\n\f[37] H. You, Y. Feng, X. Zhao, C. Zou, R. Ji, and Y. Gao. PVRNet: Point-view relation neural\nIn Proceedings of the AAAI Conference on Arti\ufb01cial\n\nnetwork for 3D shape recognition.\nIntelligence, volume 33, pages 9119\u20139126, 2019.\n\n[38] K. Zhang, B. Sch\u00f6lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and\nconditional shift. In Proceedings of the International Conference on Machine Learning, pages\n819\u2013827, 2013.\n\n[39] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu. Image super-resolution using very deep\nresidual channel attention networks. In Proceedings of the European Conference on Computer\nVision, 2018.\n\n[40] X. Zhou, A. Karpur, C. Gan, L. Luo, and Q. Huang. Unsupervised domain adaptation for\n3d keypoint estimation via view consistency. In Proceedings of the European Conference on\nComputer Vision, pages 137\u2013153, 2018.\n\n[41] Y. Zhou and O. Tuzel. VoxelNet: End-to-end learning for point cloud based 3d object detection.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n4490\u20134499, 2018.\n\n[42] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang. Unsupervised domain adaptation for semantic\nsegmentation via class-balanced self-training. In Proceedings of the European Conference on\nComputer Vision, pages 289\u2013305, 2018.\n\n12\n\n\f", "award": [], "sourceid": 3907, "authors": [{"given_name": "Can", "family_name": "Qin", "institution": "Northeastern University"}, {"given_name": "Haoxuan", "family_name": "You", "institution": "Columbia University"}, {"given_name": "Lichen", "family_name": "Wang", "institution": "Northeastern University"}, {"given_name": "C.-C. Jay", "family_name": "Kuo", "institution": "University of Southern California"}, {"given_name": "Yun", "family_name": "Fu", "institution": "Northeastern University"}]}