{"title": "Population Matching Discrepancy and Applications in Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6262, "page_last": 6272, "abstract": "A differentiable estimation of the distance between two distributions based on samples is important for many deep learning tasks. One such estimation is maximum mean discrepancy (MMD). However, MMD suffers from its sensitive kernel bandwidth hyper-parameter, weak gradients, and large mini-batch size when used as a training objective. In this paper, we propose population matching discrepancy (PMD) for estimating the distribution distance based on samples, as well as an algorithm to learn the parameters of the distributions using PMD as an objective. PMD is defined as the minimum weight matching of sample populations from each distribution, and we prove that PMD is a strongly consistent estimator of the first Wasserstein metric. We apply PMD to two deep learning tasks, domain adaptation and generative modeling. Empirical results demonstrate that PMD overcomes the aforementioned drawbacks of MMD, and outperforms MMD on both tasks in terms of the performance as well as the convergence speed.", "full_text": "Population Matching Discrepancy and\n\nApplications in Deep Learning\n\nJianfei Chen, Chongxuan Li, Yizhong Ru, Jun Zhu\u2217\n\nDept. of Comp. Sci. & Tech., TNList Lab, State Key Lab for Intell. Tech. & Sys.\n\nTsinghua University, Beijing, 100084, China\n\n{chenjian14,licx14,ruyz13}@mails.tsinghua.edu.cn, dcszj@tsinghua.edu.cn\n\nAbstract\n\nA differentiable estimation of the distance between two distributions based on\nsamples is important for many deep learning tasks. One such estimation is maxi-\nmum mean discrepancy (MMD). However, MMD suffers from its sensitive kernel\nbandwidth hyper-parameter, weak gradients, and large mini-batch size when used\nas a training objective. In this paper, we propose population matching discrepancy\n(PMD) for estimating the distribution distance based on samples, as well as an\nalgorithm to learn the parameters of the distributions using PMD as an objective.\nPMD is de\ufb01ned as the minimum weight matching of sample populations from each\ndistribution, and we prove that PMD is a strongly consistent estimator of the \ufb01rst\nWasserstein metric. We apply PMD to two deep learning tasks, domain adaptation\nand generative modeling. Empirical results demonstrate that PMD overcomes the\naforementioned drawbacks of MMD, and outperforms MMD on both tasks in terms\nof the performance as well as the convergence speed.\n\n1\n\nIntroduction\n\nRecent advances on image classi\ufb01cation [26], speech recognition [19] and machine translation [9]\nsuggest that properly building large models with a deep hierarchy can be effective to solve realistic\nlearning problems. Many deep learning tasks, such as generative modeling [16, 3], domain adapta-\ntion [5, 47], model criticism [32] and metric learning [14], require estimating the statistical divergence\nof two probability distributions. A challenge is that in many tasks, only the samples instead of the\nclosed-form distributions are available. Such distributions include implicit probability distributions\nand intractable marginal distributions. Without making explicit assumption on the parametric form,\nthese distributions are richer and hence can lead to better estimates [35]. In these cases, the estimation\nof the statistical divergence based on samples is important. Furthermore, as the distance can be used\nas a training objective, it need to be differentiable with respect to the parameters of the distributions\nto enable ef\ufb01cient gradient-based training.\n\nOne popular sample-based statistical divergence is the maximum mean discrepancy (MMD) [17],\nwhich compares the kernel mean embedding of two distributions in RKHS. MMD has a closed-form\nestimate of the statistical distance in quadratic time, and there are theoretical results on bounding the\napproximation error. Due to its simplicity and theoretical guarantees, MMD have been widely adopted\nin many tasks such as belief propagation [44], domain adaptation [47] and generative modeling [31].\nHowever, MMD has several drawbacks. For instance, it has a kernel bandwidth parameter that needs\ntuning [18], and the kernel can saturate so that the gradient vanishes [3] in a deep generative model.\nFurthermore, in order to have a reliable estimate of the distance, the mini-batch size must be large,\ne.g., 1000, which slows down the training by stochastic gradient descent [31].\n\n\u2217Corresponding author.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRequire: Noise distributions qX , qY and transformations T X\n\u03b8X\n\nPopulation size N , mini-batch size |B|.\nfor each iteration do\n\nDraw \u01eb \u223c qX (\u00b7), \u03be \u223c qY (\u00b7)\nCompute xi;\u03b8X = T X\n\u03b8X\nM \u2190 MinimumWeightMatching(X\u03b8X , Y\u03b8Y )\nAlign\n\n(\u01ebi) and yj;\u03b8Y = T Y\n\u03b8Y\n\nthe matched\n\npairs\n\n(\u03bej)\n\ny1;\u03b8Y , . . . , yN ;\u03b8Y\n\nyM1;\u03b8Y , . . . , yMN ;\u03b8Y\n\nfor each mini batch s \u2208 [0, |B|, 2|B|, . . . , N ] do\nd(xi;\u03b8X , yi;\u03b8Y ))\n\n\u03b8 = SGD(\u03b8, 1\n\n|B| Ps+|B|\u22121\n\ni=s\n\nend for\n\nend for\n\n, T Y\n\u03b8Y\n\n.\n\n\u2190\n\nFigure 1: Pseudocode of PMD for parameter learning with graphical illustration of an iteration. Top:\ndraw the populations and compute the matching; bottom: update the distribution parameters.\n\nIn this paper, we consider a sample-based estimation of the Wasserstein metric [49], which we refer\nto as population matching discrepancy (PMD). PMD is the cost of the minimum weight matching\nof the two sample populations from the distributions, and we show that it is a strongly consistent\nestimator of the \ufb01rst Wasserstein metric. We propose an algorithm to use PMD as a training objective\nto learn the parameters of the distribution, and reveal that PMD has some advantages over MMD:\nPMD has no bandwidth hyper-parameter, has stronger gradient, and can use normal mini-batch size,\nsuch as 100, during the learning. We compare PMD with MMD on two deep learning tasks, domain\nadaptation and generative modeling. PMD outperforms MMD in terms of both the performance and\nthe speed of convergence.\n\n2 Population Matching Discrepancy\n\nIn this section, we give the de\ufb01nition of the population matching discrepancy (PMD) and propose an\nalgorithm to learn with PMD.\n\n2.1 Population Matching Discrepancy\n\nConsider the general case where we have two distributions pX (x) and pY (y), whose PDFs are\nunknown, but we are allowed to draw samples from them. Let X = {xi}N\nj=1\ndenote the N i.i.d. samples from each distribution respectively. We de\ufb01ne the N -PMD of the two\ndistributions as\n\ni=1 and Y = {yj}N\n\nDN (X, Y) = min\n\nM\n\n1\nN\n\nN\n\nXi=1\n\nd(xi, yMi ),\n\n(1)\n\nwhere d(\u00b7, \u00b7) is any distance in the sample space (e.g., Euclidean distance) and M is a permutation\nto derive a matching between the two sets of samples. The optimal M corresponds to the bipartite\nminimum weight matching [27], where each element of the cost matrix is dij = d(xi, yj) with\ni, j \u2208 [N ], where [N ] = {1, \u00b7 \u00b7 \u00b7 , N }. Intuitively, PMD is the average distance of the matched pairs\nof samples, therefore it is non-negative and symmetric. Furthermore, as we shall see in Sec. 3.1, PMD\nis a strongly consistent estimator of the \ufb01rst Wasserstein metric [49] between pX and pY , which is a\nvalid statistical distance, i.e., D\u221e(X, Y) = 0 iff the two distributions pX and pY are identical.\n\n2.2 Parameter Learning\n\nWhile the N -PMD in Eq. (1) itself can serve as a measure of the closeness of two distributions, we\nare more interested in learning the parameter of the distributions using PMD as an objective. For\ninstance, in generative modeling [31], we have a parameterized generator distribution pX (x; \u03b8X )\nand a data distribution pY (y), and we wish to minimize the distance of these two distributions. We\n\n2\n\n210123210123210123210123\fassume the samples are obtained by applying some parameterized transformations to a known and\n\ufb01xed noise distribution, i.e.,\n\n\u01ebi \u223c qX (\u01eb), xi;\u03b8X = T X\n\n\u03b8X (\u01ebi); and \u03bej \u223c qY (\u03be), yj;\u03b8Y = T Y\n\n\u03b8Y (\u03bej).\n\nFor \ufb02exibility, the transformations can be implemented by deep neural networks. Without loss of\ngenerality, we assume both pX and pY are parameterized distributions by \u03b8X and \u03b8Y , respectively. If\npX is a \ufb01xed distribution, we can take qX = pX and T X\nto be a \ufb01xed identity mapping. Our goal for\n\u03b8X\nparameter learning is to minimize the expected N -PMD over different populations\n\nmin\n\u03b8X ,\u03b8Y\n\nE\u01eb,\u03beDN (X\u03b8X , Y\u03b8Y ),\n\n(2)\n\ni=1, \u03be = {\u03bej}N\n\nj=1, X\u03b8X = {xi;\u03b8X }N\n\nwhere \u01eb = {\u01ebi}N\nj=1, and the expectation\nis for preventing over-\ufb01tting the parameter with respect to particular populations. The parameters\ncan be optimized by stochastic gradient descent (SGD) [7]. At each iteration, we draw \u01eb and \u03be, and\ncompute an unbiased stochastic gradient\n\ni=1 and Y\u03b8Y = {yj;\u03b8Y }N\n\n\u2207\u03b8DN (X\u03b8X , Y\u03b8Y ) = \u2207\u03b8 min\n\nM\n\n1\nN\n\nN\n\nXi=1\n\nd(xi;\u03b8X , yMi;\u03b8Y ) = \u2207\u03b8\n\n1\nN\n\nN\n\nXi=1\n\nd(xi;\u03b8X , yM \u2217\n\ni ;\u03b8Y ),\n\n(3)\n\nwhere M\u2217 = argminMPN\ni=1 d(xi;\u03b8X , yMi;\u03b8Y ) is the minimum weight matching for X\u03b8X and\nY\u03b8Y . The second equality in Eq. (3) holds because the discrete matching M\u2217 should not change\nfor in\ufb01nitesimal change of \u03b8, as long as the transformations T X , T Y , and the distance d(\u00b7, \u00b7) are\ncontinuous. In other words, the gradient does not propagate through the matching.\nFurthermore, assuming that the matching M\u2217 does not change much within a small number of\ngradient updates, we can have an even cheaper stochastic gradient by subsampling the populations\n\n\u2207\u03b8DN (X\u03b8X , Y\u03b8Y ) \u2248 \u2207\u03b8\n\n1\n|B|\n\n|B|\n\nXi=1\n\nd(xBi;\u03b8X , yM \u2217\nBi\n\n;\u03b8Y ),\n\n(4)\n\nwhere a mini-batch of |B|, e.g., 100, samples is used to approximate the whole N -sample population.\nTo clarify, our population size N is known as the mini-batch size in some maximum mean discrepancy\n(MMD) literature [31], and is around 1000. Fig. 1 is the pseudocode of parameter learning for PMD\nalong with a graphical illustration. In the outer loop, we generate populations and compute the\nmatching; and in the inner loop, we perform several SGD updates of the parameter \u03b8, assuming the\nmatching M does not change much. In the graphical illustration, the distribution pY is \ufb01xed, and we\nwant to optimize the parameters of pX to minimize their PMD.\n\n2.3 Solving the Matching Problem\n\nThe minimum weight matching can be solved exactly in O(N 3) by the Hungarian algorithm [27].\nWhen the problem is simple enough, so that small N , e.g., hundreds, is suf\ufb01cient for reliable\ndistance estimation, O(N 3) time complexity is acceptable comparing with the O(N \u00d7 BackProp)\ntime complexity of computing the gradient with respect to the transformations T X\n. When\n\u03b8X\nN is larger, e.g., a few thousands, the Hungarian algorithm takes seconds to run. We resort to Drake\nand Hougardy\u2019s approximated matching algorithm [11] in O(N 2) time. The running time and model\nquality of PMD using both matching algorithms are reported in Sec. 5.3. In practice, we \ufb01nd PMD\nwith both the exact and approximate matching algorithms works well. This is not surprising because\ntraining each sample towards its approximate matching sample is still reasonable. Finally, while we\nonly implement the serial CPU version of the matching algorithms, both algorithm can be parallelized\non GPU to further improve the running speed [10, 34].\n\nand T Y\n\u03b8Y\n\n3 Theoretical Analysis and Connections to Other Discrepancies\n\nIn this section, we establish the connection between PMD with the Wasserstein metric and the\nmaximum mean discrepancy (MMD). We show that PMD is a strongly consistent estimator of the\nWasserstein metric, and compare its advantages and disadvantages with MMD.\n\n3\n\n\f3.1 Relationship with the Wasserstein Metric\n\nThe Wasserstein metric [49] was initially studied in the optimal transport theory, and has been adopted\nin computer vision [40], information retrival [50] and differential privacy [30]. The \ufb01rst Wasserstein\nmetric of two distributions pX (x) and pY (y) is de\ufb01ned as\n\ninf\n\n\u03b3(x,y)Z d(x, y)\u03b3(x, y)dxdy\ns.t.Z \u03b3(x, y)dx = pY (y), \u2200y;Z \u03b3(x, y)dy = pX (x), \u2200x; \u03b3(x, y) \u2265 0, \u2200x, y.\n\n(5)\n\nIntuitively, the Wasserstein metric is the optimal cost to move some mass distributed as pX to pY ,\nwhere the transference plan \u03b3(x, y) is the amount of mass to move from x to y. Problem (5) is not\ntractable because the PDFs of pX and pY are unknown. We approximate them with empirical distribu-\ntions \u02c6pX (x) = 1\nj=1 \u03b4yj (y), where \u03b4x(\u00b7) is the Dirac delta function\nj=1 \u03b3ij\u03b4xi,yj (x, y), where\n\nat x. To satisfy the constraints, \u03b3 should have the form \u03b3(x, y) = PN\n\n\u03b3ij \u2265 0. Letting pX = \u02c6pX and pY = \u02c6pY , we can simplify problem (5) as follows\n\ni=1 \u03b4xi (x) and \u02c6pY (y) = 1\n\ni=1PN\n\nN PN\n\nN PN\n\nmin\n\n\u03b3\n\nN\n\nN\n\nXi=1\n\nXj=1\n\nd(xi, yj)\u03b3ij\n\ns.t.\n\nN\n\nXj=1\n\n\u03b3ij =\n\n1\nN\n\n, i \u2208 [N ];\n\nN\n\nXi=1\n\n\u03b3ij =\n\n1\nN\n\n, j \u2208 [N ]; \u03b3ij \u2265 0.\n\n(6)\n\nThe linear program (6) is equivalent to the minimum weight matching problem [27], i.e., there exists\na permutation M1, . . . , MN , such that \u03b3(xi, yMi ) = 1\nN is an optimal solution (see Proposition 5.4\nin [6]). Plugging such \u03b3 back to problem (6), we obtain Eq. (1), the original de\ufb01nition of PMD.\n\nFurthermore, we can show that the solution of problem (6), i.e., the N -PMD, is a strongly consistent\nestimator of the \ufb01rst Wasserstein metric in problem (5).\n\nDe\ufb01nition 1 (Weak Convergence of Measure [48]). A sequence of probability distributions pN , N =\n1, 2, ... converges weakly to the probability distribution p, denoted as pn \u21d2 p, if limN\u2192\u221e EpN [f ] =\nEp[f ] for all bounded continuous functions f .\nProposition 3.1 (Varadarajan Theorem [48]). Let x1, ..., xN , ... be independent, identically dis-\ntributed real random variables with the density function p(x), let pN (x) = 1\ni=1 \u03b4xN (x) where\n\u03b4xN (\u00b7) is the Dirac delta function. Then pN \u21d2 p almost surely.\nProposition 3.2 (Stability of Optimal Transport [49]). Let X and Y be Polish spaces and let\nN }N \u2208N be\nd : X \u00d7 Y \u2192 R be a continuous function s.t.\nsequences of probability distributions on X and Y respectively. Assume that pX\nN \u21d2 pX (resp.\nN \u21d2 pY ). For each N , let \u03b3N be an optimal transference plan between pX\nN and pY\npY\nIf\nN .\nlim inf N \u2208NR d(x, y)\u03b3N (x, y)dxdy < +\u221e, then \u03b3N \u21d2 \u03b3, where \u03b3 is an optimal transference\n\nplan between pX and pY .\n\ninf d > \u2212\u221e. Let {pX\n\nN }N \u2208N and {pY\n\nN PN\n\nProposition 3.2 is a special case of Theorem 5.20 in [49] with \ufb01xed function d. The following theorem\nis the main result of this section.\nTheorem 3.3 (Strong Consistency of PMD). Let x1, ..., xN , ... and y1, ..., yN , ... be independent,\nidentically distributed real random variables from pX and pY , respectively. We construct a se-\nquence of PMD problems (6) between pX\ni=1 \u03b4yN (y).\nLet \u03b3N be the optimal transference plan of the N -th PMD problem. Then the sequence \u03b3N \u21d2\n\u03b3 almost surely, where \u03b3 is the optimal transference plan between pX and pY . Moreover,\n\ni=1 \u03b4xN (x) and pY\n\nN PN\n\nN PN\n\nN (x) = 1\n\nN (y) = 1\n\nlimN\u2192\u221eR d(x, y)\u03b3N (x, y)dxdy = R d(x, y)\u03b3(x, y)dxdy almost surely.\n\nThe proof is straightforward by applying Proposition 3.1 and 3.2. We also perform an empirical study\nof the approximation error with respect to the population size in Fig. 2(a).\n\nWhile the Wasserstein metric has been widely adopted in various machine learning and data mining\ntasks [40, 50, 30], it is usually used to measure the similarity between two discrete distributions,\ne.g., histograms. In contrast, PMD is a stochastic approximation of the Wasserstein metric between\ntwo continuous distributions. There is also work on estimating the Wasserstein metric of continuous\ndistributions based on samples [45]. Unlike PMD, which is approximating the primal problem,\nthey approximate the dual. Their approximation is not differentiable with respect to the distribution\n\n4\n\n\f(a) Relative approximation error w.r.t the population size\n\n(b) Distribution of normalized gradients\n\nFigure 2: Some empirical analysis results. The detailed experiment setting is described in Sec. 5.4.\n\nparameters because the parameters appear in the constraint instead of the objective. Recently,\nWasserstein GAN (WGAN) [3] proposes approximating the dual Wasserstein metric by using a neural\nnetwork \u201ccritic\u201d in place of a 1-Lipschitz function. While WGAN has shown excellent performance\non generative modeling, it can only compute a relative value of the Wasserstein metric upon to an\nunknown scale factor depending on the Lipschitz constant of the critic neural network. PMD also\ndiffers from WGAN by not requiring a separate critic network with additional parameters. Instead,\nPMD is parameter free and can be computed in polynomial time.\n\n3.2 Relationship with MMD\n\nMaximum mean discrepancy (MMD) [17] is a popular method for estimating the distance between\ntwo distributions by samples, de\ufb01ned as follows\n\nDMM D(X, Y) =\n\n1\nN 2\n\nN\n\nN\n\nXi=1\n\nXj=1\n\nk(xi, xj) \u2212\n\n2\n\nN M\n\nN\n\nM\n\nXi=1\n\nXj=1\n\nk(xi, yj) +\n\n1\nM 2\n\nM\n\nM\n\nXi=1\n\nXj=1\n\nk(yi, yj),\n\nwhere k(\u00b7, \u00b7) is a kernel, e.g., k(x, y) = exp(\u2212 kx \u2212 yk2 /2\u03c32) is the RBF kernel with bandwidth \u03c3.\nBoth MMD and the Wasserstein metric are integral probability metrics [17], with different function\nclasses. MMD has a closed-form objective, and can be evaluated in O(N M D) if x and y are D-\ndimensional vectors. In contrast, PMD needs to solve a matching problem, and the time complexity\nis O(N 2D) for computing the distance matrix, O(N 3) for exact Hungarian matching, and O(N 2)\nfor approximated matching. However, as we argued in Sec. 2.3, the time complexity for computing\nmatching is still acceptable comparing with the cost of training neural networks.\n\nComparing with MMD, PMD has a number of advantages:\n\nFewer hyper-parameter PMD do not have the kernel bandwidth \u03c3, which needs tuning.\n\nN 2 Pj k(xi, xj) xj \u2212xi\n\n\u03c32 \u2212 2\n\n\u03c32\n\n\u03c32\n\nNM Pj k(xi, yj) yj \u2212xi\n\nStronger gradient Using the RBF kernel, the gradient of MMD w.r.t a particular sample xi is\n\u2207xi DMM D(X, Y) = 1\n. When minimizing MMD,\nthe \ufb01rst term is a repulsive term between the samples from pX , and the second term is an attractive\nterm between the samples from pX and pY . The L2 norm of the term between two samples x\nand y is k(x, y) kx\u2212yk2\n, which is small if kx \u2212 yk2 is either too small or too large. As a result, if\na sample xi is an outlier, i.e., it is not close to any samples from pY , all the k(xi, yj) terms are\nsmall and xi will not receive strong gradients. On the other hand, if all the samples xi, i \u2208 [N ]\nare close to each other, xj \u2212 xi is small, so that repulsive term of the gradient is weak. Both cases\nslow down the training. In contrast, if d(x, y) = |x \u2212 y| is the L1 distance, the gradient of PMD\n\u2207xi DN (X, Y) = 1\nN sgn(xi \u2212 yMi ), where sgn(\u00b7) is the sign function, is always strong regardless\nof the closeness between xi and yMi . We compare the distribution of the relative magnitude of the\ngradient of the parameters contributed by each sample in Fig. 2(b). The PMD gradients have similar\nmagnitude for each sample, while there are many samples have small gradients for MMD.\n\nSmaller mini-batch size As we see in Sec 2.2, the SGD mini-batch size for PMD can be smaller\nthan the population size; while the mini-batch size for MMD must be equal with the population size.\nThis is because PMD only considers the distance between a sample and its matched sample, while\n\n5\n\n1248163264128256Population size N101100Relative approximation error012345Normalized magnitude of gradient050100150200250300FrequencyPMDMMD\fMMD considers the distance between all pairs of samples. As the result of smaller mini-batch size,\nPMD can converge faster than MMD when used as a training objective.\n\n4 Applications\n\n4.1 Domain Adaptation\n\ni , yS\n\ni )}NS\n\nNow we consider a scenario where the labeled data is scarce in some domain of interest (target\ndomain) but that is abundant in some related domain (source domain). Assuming that the data\ndistribution pS(X, y) for the source domain and that of the target domain, i.e. pT (X, y) are similar\nbut not the same, unsupervised domain adaptation aims to train a model for the target domain, given\nsome labeled data {(X S\nj=1 from\nthe target domain. According to the domain adaptation theory [5], the generalization error on the\ntarget domain depends on the generalization error on the source domain as well as the difference\nbetween the two domains. Therefore, one possible solution for domain adaptation is to learn a feature\nextractor \u03c6(X) shared by both domains, which de\ufb01nes feature distributions p\u03c6\nT for both\ndomains, and minimize some distance between the feature distributions [47] as a regularization. Since\nthe data distribution is inaccessible, we replace all distributions with their empirical distributions \u02c6pS,\n\u02c6pT , \u02c6p\u03c6\n\ni=1 from the source domain and some unlabeled data {X T\n\nT , and the training objective is\n\nS and p\u03c6\n\nS and \u02c6p\u03c6\n\nj }NT\n\nEX,y\u223c \u02c6pS L(y, h(\u03c6(X))) + \u03bbD(\u02c6p\u03c6\n\nS, \u02c6p\u03c6\n\nT ),\n\nwhere L(\u00b7, \u00b7) is a loss function, h(\u00b7) is a classi\ufb01er, \u03bb is a hyper-parameter, and D(\u02c6p\u03c6\nT ) is the domain\nadaptation regularization. While the Wasserstein metric itself of two empirical distribution is tractable,\nit can be too expensive to compute due to the large size of the dataset. Therefore, we still approximate\nthe distance with (expected) PMD, i.e., D(\u02c6p\u03c6\n\nT ) \u2248 EXS \u223c \u02c6pS ,XT \u223c \u02c6pT DP M D(\u03c6(XS), \u03c6(XT )).\n\nS, \u02c6p\u03c6\n\nS, \u02c6p\u03c6\n\n4.2 Deep Generative Modeling\n\nDeep generative models (DGMs) aim at capturing the complex structures of the data by combining\nhierarchical architectures and probabilistic modelling. They have been proven effective on image\ngeneration [38] and semi-supervised learning [23] recently. There are many different DGMs, includ-\ning tractable auto-regressive models [37], latent variable models [24, 39], and implicit probabilistic\nmodels [16, 31]. We focus on learning implicit probabilistic models, which de\ufb01ne probability dis-\ntributions on sample space \ufb02exibly without a closed-form. However, as described in Sec. 2.2, we\ncan draw samples X = T X\n(\u01eb) ef\ufb01ciently from the models by transforming a random noise \u01eb \u223c q(\u01eb),\n\u03b8X\nwhere q is a simple distribution (e.g. uniform), to X through a parameterized model (e.g. neural\nnetwork). The parameters in the models are trained to minimize some distance between the model\ndistribution pX (X) and the empirical data distribution \u02c6pY (Y ). The distance can be de\ufb01ned based on\nan parameterized adversary, i.e., another neural network [16, 3], or directly with the samples [31].\nWe choose the distance to be the \ufb01rst Wasserstein metric, and employ its \ufb01nite-sample estimator\n(i.e., the N -PMD de\ufb01ned in Eq. (2)) as training objective directly. Training this model with MMD is\nknown as generative moment matching networks [31, 12].\n\n5 Experiments\n\nWe now study the empirical performance of PMD and compare it with MMD. In the experiments,\nPMD always use the L1 distance, and MMD always use the RBF kernel. Our experiment is conducted\non a machine with Nvidia Titan X (Pascal) GPU and Intel E5-2683v3 CPU. We implement the models\nin TensorFlow [1]. The matching algorithms are implemented in C++ with a single thread, and we\nwrite a CUDA kernel for computing the all-pair L1 distance within a population. The CUDA program\nis compiled with nvcc 8.0 and the C++ program is compiled with g++ 4.8.4, while -O3 \ufb02ag is used\nfor both programs. We use the approximate matching for the generative modeling experiment and\nexact Hungarian matching for all the other experiments.\n\n5.1 Domain Adaptation\n\nWe compare the performance of PMD and MMD on the standard Of\ufb01ce [41] object recognition\nbenchmark for domain adaptation. The dataset contains three domains: amazon, dslr and webcam, and\n\n6\n\n\fTable 1: All the 6 unsupervised domain adaptation accuracy on the Of\ufb01ce dataset between the amazon\n(a), dslr (d) and webcam (w) domains, in percentage. SVM and NN are trained only on the source\ndomain, where NN uses the same architecture of PMD and MMD, but set \u03bb = 0.\n\nMethod\nDDC [47]\nDANN [13]\nCMD [52]\nJAN-xy [33]\nSVM\nNN\nMMD\nPMD\n\na \u2192 w\n59.4 \u00b1 .8\n\n73.0\n\n77.0 \u00b1 .6\n78.1 \u00b1 .4\n\n65.0\n\nd \u2192 w\n92.5 \u00b1 .3\n\n96.4\n\nw \u2192 d\n91.7 \u00b1 .8\n\n99.2\n\n96.3 \u00b1 .4\n99.2 \u00b1 .2\n96.4 \u00b1 .2 99.3 \u00b1 .1\n\n96.1\n\n99.4\n\n-\n-\n\n79.6 \u00b1 .6\n77.5 \u00b1 .2\n\n70.7\n\n96.3 \u00b1 .2\n67.8 \u00b1 .5\n76.9 \u00b1 .8\n96.2 \u00b1 .2\n86.2 \u00b1 .7 96.2 \u00b1 .3\n\n73.9 \u00b1 .6\n99.5 \u00b1 .2\n99.6 \u00b1 .2 78.4\u00b11.0\n99.5 \u00b1 .2\n\n58.5 \u00b1 .3\n64.9 \u00b1 .5\n82.7 \u00b1 .8 64.3 \u00b1 .4\n\na \u2192 d\n\nd \u2192 a\n\nw \u2192 a\n\navg.\n\n-\n-\n\n-\n-\n\n63.8 \u00b1 .7\n63.3 \u00b1 .6\n68.4 \u00b1 .2 65.0 \u00b1 .4\n\n56.4\n\n55.1\n\n58.1 \u00b1 .3\n68.1 \u00b1 .6\n66.8 \u00b1 .4\n\n-\n-\n\n79.9\n80.8\n73.8\n75.7\n80.7\n82.6\n\n(a) Convergence speed\n\n(b) MMD parameter sensitivity\n\n(c) PMD parameter sensitivity\n\nFigure 3: Convergence speed and parameter sensitivity on the Of\ufb01ce d \u2192 a task.\n\nthere are 31 classes. Following [52], we use the 4096-dimensional VGG-16 [43] feature pretrained\non ImageNet as the input. The classi\ufb01er is a fully-connected neural network with a single hidden\nlayer of 256 ReLU [15] units, trained with AdaDelta [51]. The domain regularization term is put on\nthe hidden layer. We apply batch normalization [21] on the hidden layer, and the activations from\nthe source and the target domain are normalized separately. Following [8], we validate the domain\nregularization strength \u03bb and the MMD kernel bandwidth \u03c3 on a random 100-sample labeled dataset\non the target domain, but the model is trained without any labeled data from the target domain. The\nexperiment is then repeated for 10 times on the hyper-parameters with the best validation error. Since\nwe perform such validation for both PMD and MMD, the comparison between them is fair. The\nresult is reported in Table 1, and PMD outperforms MMD on the a \u2192 w and a \u2192 d tasks by a large\nmargin, and is comparable with MMD on the other 4 tasks.\n\nThen, we compare the convergence speed of PMD and MMD on the d \u2192 a task. We choose this\ntask because PMD and MMD have similar performance on it. The result is shown in Fig. 3(a), where\nPMD converges faster than MMD. We also show the parameter sensitivity of MMD and PMD as\nFig. 3(b) and Fig. 3(c), respectively. The performance of MMD is sensitive to both the regularization\nparameter \u03bb and the kernel bandwidth \u03c3, so we need to tune both parameters. In contrast, PMD only\nhas one parameter to tune.\n\n5.2 Generative Modeling\n\nWe compare PMD with MMD for image generation on the MNIST [28], SVHN [36] and LFW [20]\ndataset. For SVHN, we train the models on the 73257-image training set. The LFW dataset is\nconverted to 32 \u00d7 32 gray-scale images [2], and there are 13233 images for training. The noise\n\u01eb follows a uniform distribution [\u22121, 1]40. We implemented three architectures, including a fully-\nconnected (fc) network as the transformation T X\n, a deconvolutional (conv) network, and a fully-\n\u03b8X\nconnected network for generating the auto-encoder codes (ae) [31], where the auto-encoder is a\nconvolutional one pre-trained on the dataset. For MMD, we use a mixture of kernels of different\nbandwidths for the fc and conv architecture, and the bandwidth is \ufb01xed at 1 for the ae architecture,\nfollowing the settings in the generative moment matching networks (GMMN) paper. We set the\npopulation size N = 2000 for both PMD and MMD, and the mini-batch size |B| = 100 for PMD.\nWe use the AdaM optimizer [22] with batch normalization [21], and train the model for 100 epoches\nfor PMD, and 500 epoches for MMD. The generated images on the SVHN and LFW dataset are\n\n7\n\n05001000150020002500300035004000number of iterations0.540.560.580.600.620.640.66test accuracyPMDMMD0.25149162536496481100bandwidth0.030.10.313103010030010003000regularization0.580.550.560.590.640.60.590.60.560.570.580.570.560.540.580.650.640.620.620.610.60.610.580.570.550.540.650.650.640.640.610.610.60.580.560.560.530.610.680.660.640.620.620.610.590.570.570.550.590.660.620.650.630.620.630.580.60.580.550.570.640.650.650.640.620.620.590.580.580.560.560.620.650.650.640.630.620.570.560.570.570.560.590.60.60.620.630.620.590.590.580.580.570.590.580.60.60.60.60.570.570.570.560.580.570.590.580.590.60.60.580.580.560.580.570.560.580.580.590.580.570.5500.5750.6000.6250.6500.6751041031021011101102regularization0.4750.5000.5250.5500.5750.6000.6250.650test accuracy\ffc\n\nconv\n\nae\n\nMMD\n\nPMD\n\nMMD\n\nPMD\n\nFigure 4: Image generation results on SVHN (top two rows) and LFW (bottom two rows).\n\n(a) PMD sensitivity w.r.t. N and |B|\n\n(b) sensitivity of MMD w.r.t. N\n\n(c) split of the time per epoch\n\nFigure 5: Convergence and timing results. The \u201cExact N = 500\u201d curve in (a) uses the Hungarian\nalgorithm, and the rest uses the approximated matching algorithm.\n\npresented in Fig. 4, and the images on the MNIST dataset can be found in the supplementary material.\nWe observe that the images generated by PMD are less noisy than that generated by MMD. While\nMMD only performs well on the autoencoder code space (ae), PMD generates acceptable images on\npixel space. We also noticed the generated images of PMD on the SVHN and LFW datasets are blurry.\nOne reason for this is the pixel-level L1 distance is not good for natural images. Therefore, learning\nthe generative model on the code space helps. To verify that PMD does not trivially reproduce\nthe training dataset, we perform a circular interpolation in the representation space q(\u01eb) between 5\nrandom points, the result is available in the supplementary material.\n\n5.3 Convergence Speed and Time Consumption\n\nWe study the impact of the population size N , the mini-batch size |B| and the choice of matching\nalgorithm to PMD. Fig. 5(a) shows the \ufb01nal PMD evaluated on N = 2000 samples on the MNIST\ndataset, using the fc architecture, after 100 epoches. The results show that the solution is insensitive\nto neither the population size N nor the choice of the matching algorithm, which implies that we\ncan use the cheap approximated matching and relatively small population size for speed. On the\nother hand, decreasing the mini-batch size |B| improves the \ufb01nal PMD signi\ufb01cantly, supporting our\nclaim in Sec. 3.2 that the ability of using small |B| is indeed an advantage for PMD. Unlike PMD,\nthere is a trade-off for selecting the population size N for MMD, as shown in Fig. 5(b). If N is too\nlarge, the SGD optimization converges slowly; if N is too small, the MMD estimation is unreliable.\nFig. 5(c) shows the total time spent on exact matching, approximated matching and SGD respectively\nfor each epoch. The cost of approximated matching is comparable with the cost of SGD. Again, we\nemphasize while we only have single thread implementations for the matching algorithms, both the\nexact [10] and approximated matching [34] can be signi\ufb01cantly accelerated with GPU.\n\n5.4 Empirical Studies\n\nWe examine the approximation error of PMD on a toy dataset. We compute the distances between two\n5-dimensional standard isotropic Gaussian distributions. One distribution is centered at the origin and\nthe other is at (10, 0, 0, 0, 0). The \ufb01rst Wasserstein metric between these two distributions is 10. We\nvary the population size N and compute the relative approximation error = |DN (X, Y) \u2212 10|/10\nfor 100 different populations (X, Y) for each N . The result is shown in Fig. 2(a). We perform a\n\n8\n\n400020001000500250100Mini-batch size |B|5254565860Final PMDN=500N=1000N=2000N=4000Exact N=50080006000400020001000500250100Population size N0.01200.01210.01220.01230.01240.01250.0126Final MMD500100020004000Population size N100101102Time (seconds)ExactRandomizedSGD\flinear regression between log N and the logarithm of expected approximation error, and \ufb01nd that the\nerror is roughly proportional to N \u22120.23.\n\nWe also validate the claim in Sec. 3.2 on the stronger gradients of PMD than that of MMD. We\ncalculate the magnitude (in L2 norm) of the gradient of the parameters contributed by each sample.\nThe gradients are computed on the converged model, and the model is the same as Sec. 5.3. Because\nthe scale of the gradients depend on the scale of the loss function, we normalize the magnitudes\nby dividing them with the average magnitude of the gradients. We then show the distribution of\nnormalized magnitudes of gradients in Fig. 2(b). The PMD gradients contributed by each sample are\nclose with each other, while there are many samples contributing small gradients for MMD, which\nmay slow down the \ufb01tting of these samples.\n\n6 Conclusions\n\nWe present population matching discrepancy (PMD) for estimating the distance between two prob-\nability distributions by samples. PMD is the minimum weight matching between two random\npopulations from the distributions, and we show that PMD is a strongly consistent estimator of the\n\ufb01rst Wasserstein metric. We also propose a stochastic gradient descent algorithm to learn parameters\nof the distributions using PMD. Comparing with the popular maximum mean discrepancy (MMD),\nPMD has no kernel bandwidth hyper-parameter, stronger gradient and smaller mini-batch size for\ngradient-based optimization. We apply PMD to domain adaptation and generative modeling tasks.\nEmpirical results show that PMD outperforms MMD in terms of performance and convergence speed\nin both tasks. In the future, we plan to derive \ufb01nite-sample error bounds for PMD, study its testing\npower, and accelerate the computation of minimum weight matching with GPU.\n\nAcknowledgments\n\nThis work is supported by the National NSF of China (Nos. 61620106010, 61621136008, 61332007),\nthe MIIT Grant of Int. Man. Comp. Stan (No. 2016ZXFB00001), the Youth Top-notch Talent\nSupport Program, Tsinghua Tiangong Institute for Intelligent Computing and the NVIDIA NVAIL\nProgram.\n\nReferences\n\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] Siddharth Agrawal.\n\nGenerative Moment Matching Networks.\n\nhttps://github.com/\n\nsiddharth-agrawal/Generative-Moment-Matching-Networks.\n\n[3] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[4] Marc G Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan\nHoyer, and R\u00e9mi Munos. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint\narXiv:1705.10743, 2017.\n\n[5] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman\n\nVaughan. A theory of learning from different domains. Machine learning, 79(1):151\u2013175, 2010.\n\n[6] Dimitri P Bertsekas. Network optimization: continuous and discrete models. Citeseer, 1998.\n\n[7] L\u00e9on Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMP-\n\nSTAT\u20192010, pages 177\u2013186. Springer, 2010.\n\n[8] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan.\nDomain separation networks. In Advances in Neural Information Processing Systems, pages 343\u2013351,\n2016.\n\n[9] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger\nSchwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical\nmachine translation. In EMNLP, 2014.\n\n9\n\n\f[10] Ketan Date and Rakesh Nagi. Gpu-accelerated hungarian algorithms for the linear assignment problem.\n\nParallel Computing, 57:52\u201372, 2016.\n\n[11] Doratha Drake and Stefan Hougardy.\n\nImproved linear time approximation algorithms for weighted\nmatchings. Approximation, Randomization, and Combinatorial Optimization.. Algorithms and Techniques,\npages 21\u201346, 2003.\n\n[12] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks\n\nvia maximum mean discrepancy optimization. In UAI, 2015.\n\n[13] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois Laviolette,\nMario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Journal of\nMachine Learning Research, 17(59):1\u201335, 2016.\n\n[14] Bo Geng, Dacheng Tao, and Chao Xu. Daml: Domain adaptation metric learning. IEEE Transactions on\n\nImage Processing, 20(10):2980\u20132989, 2011.\n\n[15] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In AISTATS,\n\nvolume 15, page 275, 2011.\n\n[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[17] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola. A\n\nkernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\n[18] Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji\nFukumizu, and Bharath K Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In\nAdvances in neural information processing systems, pages 1205\u20131213, 2012.\n\n[19] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew\nSenior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic\nmodeling in speech recognition: The shared views of four research groups. IEEE Signal Processing\nMagazine, 29(6):82\u201397, 2012.\n\n[20] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A\ndatabase for studying face recognition in unconstrained environments. Technical Report 07-49, University\nof Massachusetts, Amherst, October 2007.\n\n[21] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n[22] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. 2014.\n\n[23] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems, pages\n3581\u20133589, 2014.\n\n[24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[25] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[27] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly,\n\n2(1-2):83\u201397, 1955.\n\n[28] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[29] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnab\u00e1s P\u00f3czos. Mmd gan: Towards\n\ndeeper understanding of moment matching network. arXiv preprint arXiv:1705.08584, 2017.\n\n[30] Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy beyond k-anonymity\nand l-diversity. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, pages\n106\u2013115. IEEE, 2007.\n\n10\n\n\f[31] Yujia Li, Kevin Swersky, and Richard S Zemel. Generative moment matching networks. In ICML, pages\n\n1718\u20131727, 2015.\n\n[32] James R Lloyd and Zoubin Ghahramani. Statistical model criticism using kernel two sample tests. In\n\nAdvances in Neural Information Processing Systems, pages 829\u2013837, 2015.\n\n[33] Mingsheng Long, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation\n\nnetworks. In ICML, 2017.\n\n[34] Fredrik Manne and Rob Bisseling. A parallel approximation algorithm for the weighted maximum matching\n\nproblem. Parallel Processing and Applied Mathematics, pages 708\u2013717, 2008.\n\n[35] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint\n\narXiv:1610.03483, 2016.\n\n[36] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in\nnatural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised\nfeature learning, volume 2011, page 5, 2011.\n\n[37] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In\n\nICML, 2016.\n\n[38] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[39] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. In ICML, 2014.\n\n[40] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover\u2019s distance as a metric for image\n\nretrieval. International journal of computer vision, 40(2):99\u2013121, 2000.\n\n[41] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new\n\ndomains. Computer Vision\u2013ECCV 2010, pages 213\u2013226, 2010.\n\n[42] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pages 2234\u20132242,\n2016.\n\n[43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. In ICLR, 2015.\n\n[44] Le Song, Arthur Gretton, Danny Bickson, Yucheng Low, and Carlos Guestrin. Kernel belief propagation.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages 707\u2013715, 2011.\n\n[45] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, and Gert RG Lanckriet.\nNon-parametric estimation of integral probability metrics. In Information Theory Proceedings (ISIT), 2010\nIEEE International Symposium on, pages 1428\u20131432. IEEE, 2010.\n\n[46] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26\u201331, 2012.\n\n[47] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion:\n\nMaximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.\n\n[48] VS Varadarajan. Weak convergence of measures on separable metric spaces. Sankhy\u00afa: The Indian Journal\n\nof Statistics (1933-1960), 19(1/2):15\u201322, 1958.\n\n[49] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\n\n[50] Xiaojun Wan. A novel document similarity measure based on earth mover\u2019s distance. Information Sciences,\n\n177(18):3718\u20133730, 2007.\n\n[51] Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.\n\n[52] Werner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschl\u00e4ger, and Susanne Saminger-Platz.\n\nCentral moment discrepancy (cmd) for domain-invariant representation learning. In ICLR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 3157, "authors": [{"given_name": "Jianfei", "family_name": "Chen", "institution": "Tsinghua University"}, {"given_name": "Chongxuan", "family_name": "LI", "institution": "Tsinghua University"}, {"given_name": "Yizhong", "family_name": "Ru", "institution": "Tsinghua University"}, {"given_name": "Jun", "family_name": "Zhu", "institution": "Tsinghua University"}]}