{"title": "Modeling Tabular data using Conditional GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 7335, "page_last": 7345, "abstract": "Modeling the probability distribution of rows in tabular data and generating realistic synthetic data is a non-trivial task. Tabular data usually contains a mix of discrete and continuous columns. Continuous columns may have multiple modes whereas discrete columns are sometimes imbalanced making the modeling difficult. Existing statistical and deep neural network models fail to properly model this type of data. We design CTGAN, which uses a conditional generative adversarial network to address these challenges. To aid in a fair and thorough comparison, we design a benchmark with 7 simulated and 8 real datasets and several Bayesian network baselines. CTGAN outperforms Bayesian methods on most of the real datasets whereas other deep learning methods could not.", "full_text": "Modeling Tabular Data using Conditional GAN\n\nLei Xu\nMIT LIDS\n\nCambridge, MA\nleix@mit.edu\n\nMaria Skoularidou\n\nMRC-BSU, University of Cambridge\n\nCambridge, UK\n\nms2407@cam.ac.uk\n\nAlfredo Cuesta-Infante\n\nUniversidad Rey Juan Carlos\n\nM\u00f3stoles, Spain\n\nalfredo.cuesta@urjc.es\n\nKalyan Veeramachaneni\n\nMIT LIDS\n\nCambridge, MA\n\nkalyanv@mit.edu\n\nAbstract\n\nModeling the probability distribution of rows in tabular data and generating realistic\nsynthetic data is a non-trivial task. Tabular data usually contains a mix of discrete\nand continuous columns. Continuous columns may have multiple modes whereas\ndiscrete columns are sometimes imbalanced making the modeling dif\ufb01cult. Existing\nstatistical and deep neural network models fail to properly model this type of data.\nWe design CTGAN, which uses a conditional generator to address these challenges.\nTo aid in a fair and thorough comparison, we design a benchmark with 7 simulated\nand 8 real datasets and several Bayesian network baselines. CTGAN outperforms\nBayesian methods on most of the real datasets whereas other deep learning methods\ncould not.\n\n1\n\nIntroduction\n\nTable 1: The number of wins of a particular method\ncompared with the corresponding Bayesian network\nagainst an appropriate metric on 8 real datasets.\n\nRecent developments in deep generative mod-\nels have led to a wealth of possibilities. Us-\ning images and text, these models can learn\nprobability distributions and draw high-quality\nrealistic samples. Over the past two years,\nthe promise of such models has encouraged\nthe development of generative adversarial net-\nworks (GANs) [10] for tabular data genera-\ntion. GANs offer greater \ufb02exibility in model-\ning distributions than their statistical counter-\nparts. This proliferation of new GANs neces-\nsitates an evaluation mechanism. To evaluate\nthese GANs, we used a group of real datasets\nto set-up a benchmarking system and imple-\nmented three of the most recent techniques. For comparison purposes, we created two baseline\nmethods using Bayesian networks. After testing these models using both simulated and real datasets,\nwe found that modeling tabular data poses unique challenges for GANs, causing them to fall short\nof the baseline methods on a number of metrics such as likelihood \ufb01tness and machine learning\nef\ufb01cacy of the synthetically generated data. These challenges include the need to simultaneously\nmodel discrete and continuous columns, the multi-modal non-Gaussian values within each continuous\ncolumn, and the severe imbalance of categorical columns (described in Section 3).\n\nMedGAN, 2017 [6]\nVeeGAN, 2017 [21]\nTableGAN, 2018 [18]\nCTGAN\n\nMethod CLBN [7]\n\nPrivBN [28]\n\noutperform\n\n1\n0\n3\n7\n\n1\n2\n3\n8\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTo address these challenges, in this paper, we propose conditional tabular GAN (CTGAN)1, a method\nwhich introduces several new techniques: augmenting the training procedure with mode-speci\ufb01c\nnormalization, architectural changes, and addressing data imbalance by employing a conditional\ngenerator and training-by-sampling (described in section 4). When applied to the same datasets\nwith the benchmarking suite, CTGAN performs signi\ufb01cantly better than both the Bayesian network\nbaselines and the other GANs tested, as shown in Table 1.\nThe contributions of this paper are as follows:\n(1) Conditional GANs for synthetic data generation. We propose CTGAN as a synthetic tabular\ndata generator to address several issues mentioned above. CTGAN outperforms all methods to date\nand surpasses Bayesian networks on at least 87.5% of our datasets. To further challenge CTGAN, we\nadapt a variational autoencoder (VAE) [15] for mixed-type tabular data generation. We call this TVAE.\nVAEs directly use data to build the generator; even with this advantage, we show that our proposed\nCTGAN achieves competitive performance across many datasets and outperforms TVAE on 3 datasets.\n(2) A benchmarking system for synthetic data generation algorithms.2 We designed a compre-\nhensive benchmark framework using several tabular datasets and different evaluation metrics as well\nas implementations of several baselines and state-of-the-art methods. Our system is open source\nand can be extended with other methods and additional datasets. At the time of this writing, the\nbenchmark has 5 deep learning methods, 2 Bayesian network methods, 15 datasets, and 2 evaluation\nmechanisms.\n\n2 Related Work\n\nDuring the past decade, synthetic data has been generated by treating each column in a table as a\nrandom variable, modeling a joint multivariate probability distribution, and then sampling from that\ndistribution. For example, a set of discrete variables may have been modeled using decision trees\n[20] and Bayesian networks [2, 28]. Spatial data could be modeled with a spatial decomposition tree\n[8, 27]. A set of non-linearly correlated continuous variables could be modeled using copulas [19, 23].\nThese models are restricted by the type of distributions and by computational issues, severely limiting\nthe synthetic data\u2019s \ufb01delity.\nThe development of generative models using VAEs and, subsequently, GANs and their numerous\nextensions [1, 11, 29, 26], has been very appealing due to the performance and \ufb02exibility offered\nin representing data. GANs are also used in generating tabular data, especially healthcare records;\nfor example, [25] uses GANs to generate continuous time-series medical records and [4] proposes\nthe generation of discrete tabular data using GANs. medGAN [6] combines an auto-encoder and a\nGAN to generate heterogeneous non-time-series continuous and/or binary data. ehrGAN [5] generates\naugmented medical records. tableGAN [18] tries to solve the problem of generating synthetic data\nusing a convolutional neural network which optimizes the label column\u2019s quality; thus, generated\ndata can be used to train classi\ufb01ers. PATE-GAN [14] generates differentially private synthetic data.\n\n3 Challenges with GANs in Tabular Data Generation Task\n\nThe task of synthetic data generation task requires training a data synthesizer G learnt from a table\nT and then using G to generate a synthetic table Tsyn. A table T contains Nc continuous columns\n{C1, . . . , CNc} and Nd discrete columns {D1, . . . , DNd}, where each column is considered to be\na random variable. These random variables follow an unknown joint distribution P(C1:Nc, D1:Nd ).\nOne row rj = {c1,j, . . . , cNc,j, d1,j, . . . , dNd,j}, j \u2208 {1, . . . , n}, is one observation from the joint\ndistribution. T is partitioned into training set Ttrain and test set Ttest. After training G on Ttrain,\nTsyn is constructed by independently sampling rows using G. We evaluate the ef\ufb01cacy of a generator\nalong 2 axes. (1) Likelihood \ufb01tness: Do columns in Tsyn follow the same joint distribution as\nTtrain? (2) Machine learning ef\ufb01cacy: When training a classi\ufb01er or a regressor to predict one column\nusing other columns as features, can such classi\ufb01er or regressor learned from Tsyn achieve a similar\nperformance on Ttest, as a model learned on Ttrain?\nSeveral unique properties of tabular data challenge the design of a GAN model.\n\n1Our CTGAN model is open-sourced at https://github.com/DAI-Lab/CTGAN\n2Our benchmark can be found at https://github.com/DAI-Lab/SDGym.\n\n2\n\n\fMixed data types. Real-world tabular data consists of mixed types. To simultaneously generate a\nmix of discrete and continuous columns, GANs must apply both softmax and tanh on the output.\nNon-Gaussian distributions: In images, pixels\u2019 values follow a Gaussian-like distribution, which\ncan be normalized to [\u22121, 1] using a min-max transformation. A tanh function is usually employed\nin the last layer of a network to output a value in this range. Continuous values in tabular data are\nusually non-Gaussian where min-max transformation will lead to vanishing gradient problem.\nMultimodal distributions. We use kernel density estimation to estimate the number of modes in\na column. We observe that 57/123 continuous columns in our 8 real-world datasets have multiple\nmodes. Srivastava et al. [21] showed that vanilla GAN couldn\u2019t model all modes on a simple 2D\ndataset; thus it would also struggle in modeling the multimodal distribution of continuous columns.\nLearning from sparse one-hot-encoded vectors. When generating synthetic samples, a generative\nmodel is trained to generate a probability distribution over all categories using softmax, while the\nreal data is represented in one-hot vector. This is problematic because a trivial discriminator can\nsimply distinguish real and fake data by checking the distribution\u2019s sparseness instead of considering\nthe overall realness of a row.\nHighly imbalanced categorical columns. In our datasets we noticed that 636/1048 of the categori-\ncal columns are highly imbalanced, in which the major category appears in more than 90% of the\nrows. This creates severe mode collapse. Missing a minor category only causes tiny changes to\nthe data distribution that is hard to be detected by the discriminator. Imbalanced data also leads to\ninsuf\ufb01cient training opportunities for minor classes.\n\n4 CTGAN Model\n\nCTGAN is a GAN-based method to model tabular data distribution and sample rows from the distri-\nbution. In CTGAN, we invent the mode-speci\ufb01c normalization to overcome the non-Gaussian and\nmultimodal distribution (Section 4.2). We design a conditional generator and training-by-sampling\nto deal with the imbalanced discrete columns (Section 4.3). And we use fully-connected networks\nand several recent techniques to train a high-quality model.\n\n4.1 Notations\n\nWe de\ufb01ne the following notations.\n\n\u2013 x1 \u2295 x2 \u2295 . . .: concatenate vectors x1, x2, . . .\n\u2013 gumbel\u03c4 (x): apply Gumbel softmax[13] with parameter \u03c4 on a vector x\n\u2013 leaky\u03b3(x): apply a leaky ReLU activation on x with leaky ratio \u03b3\n\u2013 FCu\u2192v(x): apply a linear transformation on a u-dim input to get a v-dim output.\n\nWe also use tanh, ReLU, softmax, BN for batch normalization [12], and drop for dropout [22].\n\n4.2 Mode-speci\ufb01c Normalization\n\nProperly representing the data is critical in training neural networks. Discrete values can naturally be\nrepresented as one-hot vectors, while representing continuous values with arbitrary distribution is\nnon-trivial. Previous models [6, 18] use min-max normalization to normalize continuous values to\n[\u22121, 1]. In CTGAN, we design a mode-speci\ufb01c normalization to deal with columns with complicated\ndistributions.\nFigure 1 shows our mode-speci\ufb01c normalization for a continuous column. In our method, each\ncolumn is processed independently. Each value is represented as a one-hot vector indicating the\nmode, and a scalar indicating the value within the mode. Our method contains three steps.\n\n1. For each continuous column Ci, use variational Gaussian mixture model (VGM) [3] to\nestimate the number of modes mi and \ufb01t a Gaussian mixture. For instance, in Figure 1, the\nVGM \ufb01nds three modes (mi = 3), namely \u03b71, \u03b72 and \u03b73. The learned Gaussian mixture\nk=1 \u00b5kN (ci,j; \u03b7k, \u03c6k) where \u00b5k and \u03c6k are the weight and standard\n\nis PCi(ci,j) = (cid:80)3\n\ndeviation of a mode respectively.\n\n3\n\n\fFigure 1: An example of mode-speci\ufb01c normalization.\n\n2. For each value ci,j in Ci, compute the probability of ci,j coming from each mode. For\ninstance, in Figure 1, the probability densities are \u03c11, \u03c12, \u03c13. The probability densities are\ncomputed as \u03c1k = \u00b5kN (ci,j; \u03b7k, \u03c6k).\n\n3. Sample one mode from given the probability density, and use the sampled mode to normalize\nthe value. For example, in Figure 1, we pick the third mode given \u03c11, \u03c12 and \u03c13. Then\nwe represent ci,j as a one-hot vector \u03b2i,j = [0, 0, 1] indicating the third mode, and a scalar\n\u03b1i,j = ci,j\u2212\u03b73\n\nto represent the value within the mode.\n\n4\u03c63\n\nThe representation of a row become the concatenation of continuous and discrete columns\n\nrj = \u03b11,j \u2295 \u03b21,j \u2295 . . . \u2295 \u03b1Nc,j \u2295 \u03b2Nc,j \u2295 d1,j \u2295 . . . \u2295 dNd,j,\n\nwhere di,j is one-hot representation of a discrete value.\n\n4.3 Conditional Generator and Training-by-Sampling\n\nTraditionally, the generator in a GAN is fed with a vector sampled from a standard multivariate\nnormal distribution (MVN). By training together with a Discriminator or Critic neural networks, one\neventually obtains a deterministic transformation that maps the standard MVN into the distribution of\nthe data. This method of training a generator does not account for the imbalance in the categorical\ncolumns. If the training data are randomly sampled during training, the rows that fall into the minor\ncategory will not be suf\ufb01ciently represented, thus the generator may not be trained correctly. If the\ntraining data are resampled, the generator learns the resampled distribution which is different from the\nreal data distribution. This problem is reminiscent of the \u201cclass imbalance\u201d problem in discriminatory\nmodeling - the challenge however is exacerbated since there is not a single column to balance and the\nreal data distribution should be kept intact.\nSpeci\ufb01cally, the goal is to resample ef\ufb01ciently in a way that all the categories from discrete attributes\nare sampled evenly (but not necessary uniformly) during the training process, and to recover the\n(not-resampled) real data distribution during test. Let k\u2217 be the value from the i\u2217th discrete column\nDi\u2217 that has to be matched by the generated samples \u02c6r, then the generator can be interpreted\nas the conditional distribution of rows given that particular value at that particular column, i.e.\n\u02c6r \u223c PG(row|Di\u2217 = k\u2217). For this reason, in this paper we name it Conditional generator, and a GAN\nbuilt upon it is referred to as Conditional GAN.\nIntegrating a conditional generator into the architecture of a GAN requires to deal with the following\nissues: 1) it is necessary to devise a representation for the condition as well as to prepare an\ninput for it, 2) it is necessary for the generated rows to preserve the condition as it is given, and\n3) it is necessary for the conditional generator to learn the real data conditional distribution, i.e.\nPG(row|Di\u2217 = k\u2217) = P(row|Di\u2217 = k\u2217), so that we can reconstruct the original distribution as\n\nP(row) =\n\nPG(row|Di\u2217 = k\u2217)P(Di\u2217 = k).\n\n(cid:88)\n\nk\u2208Di\u2217\n\nWe present a solution that consists of three key elements, namely: the conditional vector, the generator\nloss, and the training-by-sampling method.\n\n4\n\nModel the distribution of a continuous column with VGM.For each value, compute the probability of each mode.Sample a mode and normalize the value.\fFigure 2: CTGAN model. The conditional generator can generate synthetic rows conditioned on one of\nthe discrete columns. With training-by-sampling, the cond and training data are sampled according\nto the log-frequency of each category, thus CTGAN can evenly explore all possible discrete values.\n\nConditional vector. We introduce the vector cond as the way for indicating the condition (Di\u2217 = k\u2217).\nRecall that all the discrete columns D1, . . . , DNd end up as one-hot vectors d1, . . . , dNd such that\n], for k = 1, . . . ,|Di| be\nthe ith one-hot vector is di = [d(k)\nthe ith mask vector associated to the ith one-hot vector di. Hence, the condition can be expressed in\nterms of these mask vectors as\n\n], for k = 1, . . . ,|Di|. Let mi = [m(k)\n\ni\n\ni\n\n(cid:26) 1\n\n0\n\nm(k)\n\ni =\n\nif i = i\u2217 and k = k\u2217,\notherwise.\n\nThen, de\ufb01ne the vector cond as cond = m1 \u2295 . . . \u2295 mNd. For instance, for two discrete columns,\nD1 = {1, 2, 3} and D2 = {1, 2},the condition (D2 = 1) is expressed by the mask vectors m1 =\n[0, 0, 0] and m2 = [1, 0]; so cond = [0, 0, 0, 1, 0].\nGenerator loss. During training, the conditional generator is free to produce any set of one-hot\ndiscrete vectors {\u02c6d1, . . . , \u02c6dNd}. In particular, given the condition (Di\u2217 = k\u2217) in the form of cond\nvector, nothing in the feed-forward pass prevents from producing either \u02c6d(k\u2217)\ni\u2217 = 1 for\nk (cid:54)= k\u2217. The mechanism proposed to enforce the conditional generator to produce \u02c6di\u2217 = mi\u2217 is to\npenalize its loss by adding the cross-entropy between mi\u2217 and \u02c6di\u2217, averaged over all the instances of\nthe batch. Thus, as the training advances, the generator learns to make an exact copy of the given\nmi\u2217 into \u02c6di\u2217.\nTraining-by-sampling. The output produced by the conditional generator must be assessed by the\ncritic, which estimates the distance between the learned conditional distribution PG(row|cond) and\nthe conditional distribution on real data P(row|cond). The sampling of real training data and the\nconstruction of cond vector should comply to help critic estimate the distance. Properly sample\nthe cond vector and training data can help the model evenly explore all possible values in discrete\ncolumns. For our purposes, we propose the following steps:\n\ni\u2217 = 0 or \u02c6d(k)\n\n1. Create Nd zero-\ufb01lled mask vectors mi = [m(k)\n\n]k=1...|Di|, for i = 1, . . . , Nd, so the ith\nmask vector corresponds to the ith column, and each component is associated to the category\nof that column.\n2. Randomly select a discrete column Di out of all the Nd discrete columns, with equal\nprobability. Let i\u2217 be the index of the column selected. For instance, in Figure 2, the selected\ncolumn was D2, so i\u2217 = 2.\n\ni\n\nprobability mass of each value is the logarithm of its frequency in that column.\nthe range D2 has two values and the \ufb01rst one was selected, so k\u2217 = 1.\n\n3. Construct a PMF across the range of values of the column selected in 2, Di\u2217, such that the\n4. Let k\u2217 be a randomly selected value according to the PMF above. For instance, in Figure 2,\n5. Set the k\u2217th component of the i\u2217th mask to one, i.e. m(k\u2217)\n6. Calculate the vector cond = m1 \u2295 \u00b7\u00b7\u00b7 mi\u2217 \u2295 mNd. For instance, in Figure 2, we have the\n\ni\u2217 = 1.\n\nmasks m1 = [0, 0, 0] and m2\u2217 = [1, 0], so cond = [0, 0, 0, 1, 0].\n\n5\n\nGenerator G(.)Critic C(.)Score z ~ N(0, 1)Select fromD1 and D2\u03b1Say D2 is selectedPick a row from T with D2 = 11, j\t\u03b21, j \u03b12, j\t\u03b22, jdd1, j2, j \u03b11, j\t\u03b21, j \u03b12, j\t\u03b22, jdd1, j2, jtrainSelect a categoryfrom D2 D2 D1 00010Say category 1 is selected\f4.4 Network Structure\n\nSince columns in a row do not have local structure, we use fully-connected networks in generator and\ncritic to capture all possible correlations between columns. Speci\ufb01cally, we use two fully-connected\nhidden layers in both generator and critic. In generator, we use batch-normalization and Relu\nactivation function. After two hidden layers, the synthetic row representation is generated using a\nmix activation functions. The scalar values \u03b1i is generated by tanh, while the mode indicator \u03b2i and\ndiscrete values di is generated by gumbel softmax. In critic, we use leaky relu function and dropout\non each hidden layer.\nFinally, the conditional generator G(z, cond) can be formally described as\n\nh0 = z \u2295 cond\nh1 = h0 \u2295 ReLU(BN(FC|cond|+|z|\u2192256(h0)))\nh2 = h1 \u2295 ReLU(BN(FC|cond|+|z|+256\u2192256(h1)))\n\u02c6\u03b1i = tanh(FC|cond|+|z|+512\u21921(h2))\n\u02c6\u03b2i = gumbel0.2(FC|cond|+|z|+512\u2192mi(h2))\n\u02c6di = gumbel0.2(FC|cond|+|z|+512\u2192|Di|(h2))\n\n1 \u2264 i \u2264 Nc\n1 \u2264 i \u2264 Nc\n1 \u2264 i \u2264 Nd\n\nWe use the PacGAN [17] framework with 10 samples in each pac to prevent mode collapse. The archi-\ntecture of the critic (with pac size 10) C(r1, . . . , r10, cond1, . . . , cond10) can be formally described\nas\n\nh0 = r1 \u2295 . . . \u2295 r10 \u2295 cond1 \u2295 . . . \u2295 cond10\nh1 = drop(leaky0.2(FC10|r|+10|cond|\u2192256(h0)))\nh2 = drop(leaky0.2(FC256\u2192256(h1)))\nC(\u00b7) = FC256\u21921(h2)\n\nWe train the model using WGAN loss with gradient penalty [11]. We use Adam optimizer with\nlearning rate 2 \u00b7 10\u22124.\n\n4.5 TVAE Model\n\nVariational autoencoder is another neural network generative model. We adapt VAE to tabular data\nby using the same preprocessing and modifying the loss function. We call this model TVAE. In\nTVAE, we use two neural networks to model p\u03b8(rj|zj) and q\u03c6(zj|rj), and train them using evidence\nlower-bound (ELBO) loss [15].\nThe design of the network p\u03b8(rj|zj) that needs to be done differently so that the probability can\nbe modeled accurately. In our design, the neural network outputs a joint distribution of 2Nc + Nd\nvariables, corresponding to 2Nc + Nd variables rj. We assume \u03b1i,j follows a Gaussian distribution\nwith different means and variance. All \u03b2i,j and di,j follow a categorical PMF. Here is our design.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nh1 = ReLU(FC128\u2192128(zj))\nh2 = ReLU(FC128\u2192128(h1))\n\u00af\u03b1i,j = tanh(FC128\u21921(h2))\n\u02c6\u03b1i,j \u223c N (\u00af\u03b1i,j, \u03b4i)\n\u02c6\u03b2i,j \u223c softmax(FC128\u2192mi(h2))\n\u02c6di,j \u223c softmax(FC128\u2192|Di|(h2))\n\np\u03b8(rj|zj) =(cid:81)Nc\n\nP(\u02c6\u03b1i,j = \u03b1i,j)(cid:81)Nc\n\ni=1\n\ni=1\n\nP( \u02c6\u03b2i,j = \u03b2i,j)(cid:81)Nd\n\ni=1\n\nP(\u02c6\u03b1i,j = \u03b1i,j)\n\n1 \u2264 i \u2264 Nc\n1 \u2264 i \u2264 Nc\n1 \u2264 i \u2264 Nc\n1 \u2264 i \u2264 Nd\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3\n\nHere \u02c6\u03b1i,j, \u02c6\u03b2i,j, \u02c6di,j are random variables. And p\u03b8(rj|zj) is the joint distribution of these variables.\nIn p\u03b8(rj|zj), weight matrices and \u03b4i are parameters in the network. These parameters are trained\nusing gradient descent.\nThe modeling for q\u03c6(zj|rj) is similar to conventional VAE.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nh1 = ReLU(FC|rj|\u2192128(rj))\nh2 = ReLU(FC128\u2192128(h1))\n\u00b5 = FC128\u2192128(h2)\n\u03c3 = exp( 1\nq\u03c6(zj|rj) \u223c N (\u00b5, \u03c3I)\n\n2 FC128\u2192128(h2))\n\n6\n\n\fTVAE is trained using Adam with learning rate 1e-3.\n\n5 Benchmarking Synthetic Data Generation Algorithms\n\nThere are multiple deep learning methods for modeling tabular data. We noticed that all methods\nand their corresponding papers neither employed the same datasets nor were evaluated under similar\nmetrics. This fact made comparison challenging and did not allow for identifying each method\u2019s\nweaknesses and strengths vis-a-vis the intrinsic challenges presented when modeling tabular data. To\naddress this, we developed a comprehensive benchmarking suite.\n\n5.1 Baselines and Datasets\n\nIn our benchmarking suite, we have baselines that consist of Bayesian networks (CLBN [7], PrivBN\n[28]), and implementations of current deep learning approaches for synthetic data generation (MedGAN\n[6], VeeGAN [21], TableGAN [18]). We compare TVAE and CTGAN with these baselines.\nOur benchmark contains 7 simulated datasets and 8 real datasets.\nSimulated data: We handcrafted a data oracle S to represent a known joint distribution, then sample\nTtrain and Ttest from S. This oracle is either a Gaussian mixture model or a Bayesian network.\nWe followed procedures found in [21] to generate Grid and Ring Gaussian mixture oracles. We\nadded random offset to each mode in Grid and called it GridR. We picked 4 well known Bayesian\nnetworks - alarm, child, asia, insurance,3 - and constructed Bayesian network oracles.\nReal datasets: We picked 6 commonly used machine learning datasets from UCI machine learning\nrepository [9], with features and label columns in a tabular form - adult, census, covertype,\nintrusion and news. We picked credit from Kaggle. We also binarized 28 \u00d7 28 the MNIST [16]\ndataset and converted each sample to 784 dimensional feature vector plus one label column to mimic\nhigh dimensional binary data, called MNIST28. We resized the images to 12 \u00d7 12 and used the same\nprocess to generate a dataset we call MNIST12. All in all there are 8 real datasets in our benchmarking\nsuite.\n\n5.2 Evaluation Metrics and Framework\n\nGiven that evaluation of generative models is not a straightforward process, where different metrics\nyield substantially diverse results [24], our benchmarking suite evaluates multiple metrics on multiple\ndatasets. Simulated data come from a known probability distribution and for them we can evaluate\nthe generated synthetic data via likelihood \ufb01tness metric. For real datasets, there is a machine learning\ntask and we evaluate synthetic data generation method via machine learning ef\ufb01cacy. Figure 3\nillustrates the evaluation framework.\nLikelihood \ufb01tness metric: On simulated data, we take advantage of simulated data oracle S to\ncompute the likelihood \ufb01tness metric. We compute the likelihood of Tsyn on S as Lsyn. Lsyn prefers\nover\ufb01ted models. To overcome this issue, we use another metric, Ltest. We retrain the simulated\ndata oracle S(cid:48) using Tsyn. S(cid:48) has the same structure but different parameters than S. If S is a\nGaussian mixture model, we use the same number of Gaussian components and retrain the mean and\ncovariance of each component. If S is a Bayesian network, we keep the same graphical structure and\nlearn a new conditional distribution on each edge. Then Ltest is the likelihood of Ttest on S(cid:48). This\nmetric overcomes the issue in Lsyn. It can detect mode collapse. But this metric introduces the prior\nknowledge of the structure of S(cid:48) which is not necessarily encoded in Tsyn.\nMachine learning ef\ufb01cacy: For a real dataset, we cannot compute the likelihood \ufb01tness, instead\nwe evaluate the performance of using synthetic data as training data for machine learning. We train\nprediction models on Tsyn and test prediction models using Ttest. We evaluate the performance\nof classi\ufb01cation tasks using accuracy and F1, and evaluate the regression tasks using R2. For each\ndataset, we select classi\ufb01ers or regressors that achieve reasonable performance on each data. (Models\nand hyperparameters can be found in supplementary material as well as our benchmark framework.)\nSince we are not trying to pick the best classi\ufb01cation or regression model, we take the the average\nperformance of multiple prediction models to evaluate our metric for G.\n\n3The structure of Bayesian networks can be found at http://www.bnlearn.com/bnrepository/.\n\n7\n\n\fFigure 3: Evaluation framework on simulated data (left) and real data (right).\n\nTable 2: Benchmark results over three sets of experiments, namely Gaussian mixture simulated data\n(GM Sim.), Bayesian network simulated data (BN Sim.), and real data. For GM Sim. and BN Sim.,\nwe report the average of each metric. For real datasets, we report average F1 for classi\ufb01cation tasks\nand R2 for regression tasks respectively.\n\nGM Sim.\n\nBN Sim.\n\nMethod\nIdentity\nCLBN\nPrivBN\nMedGAN\nVEEGAN\nTableGAN\nTVAE\nCTGAN\n\nLsyn\n-2.61\n-3.06\n-3.38\n-7.27\n-10.06\n-8.24\n-2.65\n-5.72\n\nLtest\n-2.61\n-7.31\n-12.42\n-60.03\n-4.22\n-4.12\n-5.42\n-3.40\n\nLsyn\n-9.33\n-10.66\n-12.97\n-11.14\n-15.40\n-11.84\n-6.76\n-11.67\n\nLtest\n-9.36\n-9.92\n-10.90\n-12.15\n-13.86\n-10.47\n-9.59\n-10.60\n\nReal\n\nclf\n0.743\n0.382\n0.225\n0.137\n0.143\n0.162\n0.519\n0.469\n\nreg\n0.14\n-6.28\n-4.49\n-8.80\n-6.5e6\n-3.09\n-0.20\n-0.43\n\n5.3 Benchmarking Results\n\nWe evaluated CLBN, PrivBN, MedGAN, VeeGAN, TableGAN, CTGAN, and TVAE using our benchmark\nframework. We trained each model with a batch size of 500. Each model is trained for 300 epochs.\nEach epoch contains N/batch_size steps where N is the number of rows in the training set. We\nposit that for any dataset, across any metrics except Lsyn, the best performance is achieved by Ttrain.\nThus we present the Identity method which outputs Ttrain.\nWe summarize the benchmark results in Table 2. Full results table can be found in Supplementary\nMaterial. For simulated data from Gaussian mixture, CLBN and PrivBN suffer because continuous\nnumeric data has to be discretized before modeling using Bayesian networks. MedGAN, VeeGAN, and\nTableGAN all suffer from mode collapse. With mode-speci\ufb01c normalization, our model performs\nwell on these 2-dimensional continuous datasets.\nOn simulated data from Bayesian networks, CLBN and PrivBN have a natural advantage. Our CTGAN\nachieves slightly better performance than MedGAN and TableGAN. Surprisingly, TableGAN works\nwell on these datasets, despite considering discrete columns as continuous values. One possible\nreasoning for this is that in our simulated data, most variables have fewer than 4 categories, so\nconversion does not cause serious problems.\nOn real datasets, TVAE and CTGAN outperform CLBN and PrivBN, whereas other GAN models cannot\nget as good a result as Bayesian networks. With respect to large scale real datasets, learning a\nhigh-quality Bayesian network is dif\ufb01cult. So models trained on CLBN and PrivBN synthetic data are\n36.1% and 51.8% worse than models trained on real data.\nTVAE outperforms CTGAN in several cases, but GANs do have several favorable attributes, and this\ndoes not indicate that we should always use VAEs rather than GANs to model tables. The generator\nin GANs does not have access to real data during the entire training process; thus, we can make\nCTGAN achieve differential privacy [14] easier than TVAE.\n\n8\n\nParameterizedSimulated DataOracle SsynLikelihood LLikelihood LtestPass the oracleRe-parameterizedOracle S\u2019TrainingDataSynthetic DataGeneratorSyntheticDataTestDataTrainingDataSynthetic DataGeneratorTrain predictionmodelsAccuracyF1R2Test prediction modelsDecision TreeLinear SVMMLPSyntheticDataTestDataLearn oracleparameters fromsynthetic data\f5.4 Ablation Study\n\nWe did an ablation study to understand the usefulness of each of the components in our model.\nTable 3 shows the results from the ablation study.\nMode-speci\ufb01c normalization. In CTGAN, we use variational Gaussian mixture model (VGM) to\nnormalize continuous columns. We compare it with (1) GMM5: Gaussian mixture model with 5 modes,\n(2) GMM10: Gaussian mixture model with 10 modes, and (3) MinMax: min-max normalization to\n[\u22121, 1]. Using GMM slightly decreases the performance while min-max normalization gives the\nworst performance.\nConditional generator and training-by-sampling: We successively remove these two components.\n(1) w/o S.: we \ufb01rst disable training-by-sampling in training, but the generator still gets a condition\nvector and its loss function still has the cross-entropy term. The condition vector is sampled from\ntraining data frequency instead of log frequency. (2) w/o C.: We further remove the condition\nvector in the generator. These ablation results show that both training-by-sampling and conditional\ngenerator are critical for imbalanced datasets. Especially on highly imbalanced dataset such as\ncredit, removing training-by-sampling results in 0% on F1 metric.\nNetwork architecture: In the paper, we use WGANGP+PacGAN. Here we compare it with three\nalternatives, WGANGP only, vanilla GAN loss only, and vanilla GAN + PacGAN. We observe that\nWGANGP is more suitable for synthetic data task than vanilla GAN, while PacGAN is helpful for\nvanilla GAN loss but not as important for WGANGP.\n\nTable 3: Ablation study results on mode-speci\ufb01c normalization, conditional generator and training-\nby-sampling module, as well as the network architecture. The absolute performance change on real\nclassi\ufb01cation datasets (excluding MNIST) is reported.\n\nModel\nPerformance\n\nMode-speci\ufb01c Normalization\nGMM5\nMinMax\nGMM10\n-25.7%\n-4.1% -8.6%\n\n6 Conclusion\n\nGenerater\n\nNetwork Architechture\n\nw/o S.\nWGANGP\n-17.8% -36.5% -6.5% +1.75%\n\nw/o C.\n\nGAN\n\nGAN+PacGAN\n\n-5.2%\n\nIn this paper we attempt to \ufb01nd a \ufb02exible and robust model to learn the distribution of columns\nwith complicated distributions. We observe that none of the existing deep generative models can\noutperform Bayesian networks which discretize continuous values and learn greedily. We show\nseveral properties that make this task unique and propose our CTGAN model. Empirically, we show\nthat our model can learn a better distributions than Bayesian networks. Mode-speci\ufb01c normalization\ncan convert continuous values of arbitrary range and distribution into a bounded vector representation\nsuitable for neural networks. And our conditional generator and training-by-sampling can over come\nthe imbalance training data issue. Furthermore, we argue that the conditional generator can help\ngenerate data with a speci\ufb01c discrete value, which can be used for data augmentation. As future\nwork, we would derive a theoretical justi\ufb01cation on why GANs can work on a distribution with both\ndiscrete and continuous data.\n\nAcknowledgements\n\nThis paper is partially supported by the National Science Foundation Grants ACI-1443068. We\n(authors from MIT) also acknowledge generous support provided by Accenture for the synthetic\ndata generation project. Dr. Cuesta-Infante is funded by the Spanish Government research fundings\nRTI2018-098743-B-I00 (MICINN/FEDER) and Y2018/EMT-5062 (Comunidad de Madrid).\n\nReferences\n[1] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial\n\nnetworks. In International Conference on Machine Learning, 2017.\n\n9\n\n\f[2] Laura Avi\u00f1\u00f3, Matteo Ruf\ufb01ni, and Ricard Gavald\u00e0. Generating synthetic but plausible healthcare\nrecord datasets. In KDD workshop on Machine Learning for Medicine and Healthcare, 2018.\n\n[3] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.\n\n[4] Ramiro Camino, Christian Hammerschmidt, and Radu State. Generating multi-categorical\nsamples with generative adversarial networks. In ICML workshop on Theoretical Foundations\nand Applications of Deep Generative Models, 2018.\n\n[5] Zhengping Che, Yu Cheng, Shuangfei Zhai, Zhaonan Sun, and Yan Liu. Boosting deep\nlearning risk prediction with generative adversarial networks for electronic health records. In\nInternational Conference on Data Mining. IEEE, 2017.\n\n[6] Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, and Jimeng\nSun. Generating multi-label discrete patient records using generative adversarial networks. In\nMachine Learning for Healthcare Conference. PMLR, 2017.\n\n[7] C Chow and Cong Liu. Approximating discrete probability distributions with dependence trees.\n\nIEEE transactions on Information Theory, 14(3):462\u2013467, 1968.\n\n[8] Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. Differen-\ntially private spatial decompositions. In International Conference on Data Engineering. IEEE,\n2012.\n\n[9] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.\n\nics.uci.edu/ml.\n\n[10] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in\nNeural Information Processing Systems, 2014.\n\n[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.\nImproved training of wasserstein gans. In Advances in Neural Information Processing Systems,\n2017.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on International Conference\non Machine Learning, 2015.\n\n[13] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax.\n\nIn International Conference on Learning Representations, 2016.\n\n[14] James Jordon, Jinsung Yoon, and Mihaela van der Schaar. Pate-gan: Generating synthetic data\nwith differential privacy guarantees. In International Conference on Learning Representations,\n2019.\n\n[15] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International\n\nConference on Learning Representations, 2013.\n\n[16] Yann LeCun and Corinna Cortes. MNIST handwritten digit database, 2010. URL http:\n\n//yann.lecun.com/exdb/mnist/.\n\n[17] Zinan Lin, Ashish Khetan, Giulia Fanti, and Sewoong Oh. Pacgan: The power of two samples\nin generative adversarial networks. In Advances in Neural Information Processing Systems,\n2018.\n\n[18] Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and\nYoungmin Kim. Data synthesis based on generative adversarial networks. In International\nConference on Very Large Data Bases, 2018.\n\n[19] Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. The synthetic data vault. In International\n\nConference on Data Science and Advanced Analytics. IEEE, 2016.\n\n[20] Jerome P Reiter. Using cart to generate partially synthetic public use microdata. Journal of\n\nOf\ufb01cial Statistics, 21(3):441, 2005.\n\n10\n\n\f[21] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton.\nVeegan: Reducing mode collapse in gans using implicit variational learning. In Advances in\nNeural Information Processing Systems, 2017.\n\n[22] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[23] Yi Sun, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Learning vine copula models for\n\nsynthetic data generation. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[24] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative\n\nmodels. In International Conference on Learning Representations, 2016.\n\n[25] Alexandre Yahi, Rami Vanguri, No\u00e9mie Elhadad, and Nicholas P Tatonetti. Generative adversar-\nial networks for electronic health records: A framework for exploring and evaluating methods\nfor predicting drug-induced laboratory test trajectories. In NIPS workshop on machine learning\nfor health care, 2017.\n\n[26] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[27] Jun Zhang, Xiaokui Xiao, and Xing Xie. Privtree: A differentially private algorithm for\nhierarchical decompositions. In International Conference on Management of Data. ACM, 2016.\n\n[28] Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao.\nPrivbayes: Private data release via bayesian networks. ACM Transactions on Database Systems,\n42(4):25, 2017.\n\n[29] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image\ntranslation using cycle-consistent adversarial networks. In international conference on computer\nvision, pages 2223\u20132232. IEEE, 2017.\n\n11\n\n\f", "award": [], "sourceid": 4002, "authors": [{"given_name": "Lei", "family_name": "Xu", "institution": "MIT"}, {"given_name": "Maria", "family_name": "Skoularidou", "institution": "University of Cambridge"}, {"given_name": "Alfredo", "family_name": "Cuesta-Infante", "institution": "Universidad Rey Juan Carlos"}, {"given_name": "Kalyan", "family_name": "Veeramachaneni", "institution": "Massachusetts Institute of Technology"}]}