{"title": "Bias and Generalization in Deep Generative Models: An Empirical Study", "book": "Advances in Neural Information Processing Systems", "page_first": 10792, "page_last": 10801, "abstract": "In high dimensional settings, density estimation algorithms rely crucially on their inductive bias. Despite recent empirical success, the inductive bias of deep generative models is not well understood. In this paper we propose a framework to systematically investigate bias and generalization in deep generative models of images by probing the learning algorithm with carefully designed training datasets. By measuring properties of the learned distribution, we are able to find interesting patterns of generalization. We verify that these patterns are consistent across datasets, common models and architectures.", "full_text": "Bias and Generalization in Deep Generative Models:\n\nAn Empirical Study\n\nShengjia Zhao\u2020, Hongyu Ren\u2020, Arianna Yuan, Jiaming Song, Noah Goodman, Stefano Ermon\n\n{sjzhao,hyren,xfyuan,tsong,ngoodman,ermon}@stanford.edu\n\nStanford University\n\nAbstract\n\nIn high dimensional settings, density estimation algorithms rely crucially on their\ninductive bias. Despite recent empirical success, the inductive bias of deep gen-\nerative models is not well understood. In this paper we propose a framework to\nsystematically investigate bias and generalization in deep generative models of\nimages. Inspired by experimental methods from cognitive psychology, we probe\neach learning algorithm with carefully designed training datasets to characterize\nwhen and how existing models generate novel attributes and their combinations.\nWe identify similarities to human psychology and verify that these patterns are\nconsistent across commonly used models and architectures.\n\n1\n\nIntroduction\n\nThe goal of a density estimation algorithm is to learn a distribution from training data (Figure 1,A).\nHowever, unbiased and consistent density estimation is known to be impossible [1, 2]. Even in\ndiscrete settings, the number of possible distributions scales doubly exponentially w.r.t. dimensional-\nity [3], suggesting extremely high data requirements. As a result, the assumptions made by a learning\nalgorithm, or its inductive bias, are key when practical data regimes are concerned. For simple\ndensity estimation algorithms, such as \ufb01tting a Gaussian distribution via maximum likelihood, we can\neasily characterize the distribution that is produced given some training data. However, for complex\nalgorithms involving deep generative models such as Generative Adversarial Networks (GAN) and\nvariational autoencoders (VAE) [4\u20138], the nature of the inductive bias is very dif\ufb01cult to characterize.\nIn the absence of insights in analytic form, a possible strategy to evaluate this bias is to probe the\ninput-output behavior of the learning algorithm. The challenge with this approach is that both inputs\nand outputs are high dimensional (e.g., distributions over images), making it dif\ufb01cult to exhaustively\ncharacterize the input-output relationship. A strategy for studying high-dimensional objects is to\nproject them onto a lower dimensional space where analysis is feasible. In fact, similar problems\nhave long challenged cognitive psychologists. As visual cognitive functions are extremely complex,\ncognitive psychologists and neuroscientists have developed controlled experiments to investigate the\nvisual system. For example, experiments on perception and representation of shape, color, numerosity,\netc., have led to important discoveries such as ensemble representation [9], prototype enhancement\neffect [10], and Weber\u2019s law [11].\nWe propose to adopt experimental methods from cognitive psychology to characterize the generaliza-\ntion biases of machine intelligence. To characterize the input-output relationship of an algorithm, we\nexplore its behavior by projecting the image space onto a carefully chosen low dimensional feature\nspace. We select several features that are known to be important to humans, such as shape, color, size,\nnumerosity, etc. We systematically explore these dimensions by crafting suitable training datasets and\nmeasuring corresponding properties of the learned distribution. For example, we ask, after training\non a dataset with red and yellow spheres, and red cubes, will the model generates yellow cubes, as a\nresult of its inductive bias?\n\n\u2020 co-\ufb01rst authorship.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: A: A deep generative model can be thought of as a black box that we probe with carefully\ndesigned training data. B: First we examine the learned distribution when training data takes a single\nvalue for some feature (e.g., all training images have 3 objects). C: Next we study input distributions\nwith multiple modes for some feature (e.g., all training images have 2, 4 or 10 objects). D: We\nexplore the behavior of the model when the training data has multiple modes for multiple features.\n\nUsing this framework, we are able to systematically evaluate generalization patterns of state-of-the-\nart models such as GAN [5] and VAE [4]. Surprisingly, we found these patterns to be consistent\nacross datasets, models, and hyper-parameters choices. In addition, some of these patterns have\nstriking similarities with previously reported experiments in cognitive psychology. For example,\nwhen presented with a training set where all images contain exactly 3 objects, both GANs and VAEs\ntypically generate 2-5 objects, with a log-normal shaped distribution (Figure 1,B). If the training set\ncontains multiple modes (e.g., all images contain either 2 or 10 objects) we observe a behavior similar\nto that of a linear \ufb01lter \u2014 the algorithm acts as if it is trained separately on 2 and 10 objects and\nthen averages the two distributions. An exception is when the modes are close to each other (e.g., 2\nand 4 objects) where we observe prototype enhancement effect [10]: the learned distribution assigns\nhigher probability to the mean number of objects (3 in our example), even though no image with 3\nobjects was present in the training set (Figure 1,C). Finally for multiple features, when the training\nset only contains certain combinations, e.g., red cubes but not yellow cubes (Figure 1,D), we \ufb01nd that\nthe learning algorithms will memorize the combinations in the training set when it contains a small\nnumber of them (e.g., 20), and will generate new combinations (not in the training set) when there\nis more variety (e.g., 80). We study the amount of novel combinations generated as a function of\nthe number of combinations in the training set. For all of the observations we \ufb01nd consistent results\nacross a diverse set of tasks (CLEVR, colored dots, MNIST), training objectives (GAN or VAE),\narchitectures (convolutional or fully connected layers), and hyper-parameters.\n\n2 Density Estimation, Bias, and Generalization\nLet X be the input space (e.g., images), and let D = {x1,\u00b7\u00b7\u00b7 , xn} be a training dataset sampled i.i.d.\nfrom an underlying distribution p(x) on X . The goal of a density estimation algorithm A is to take D\nas input and produce a distribution q(x) over X that is \u201cclose\u201d to p(x). Crucially, the same algorithm\nA should work (well) on a range of input datasets, sampled from different distributions p(x).\nHowever, estimating p(x) is dif\ufb01cult [1]. In fact, even the simpli\ufb01ed task of estimating the support of\np(x) is challenging in high dimensions. For example, natural images vary along a large number of\naxes, e.g., the number of objects present, their type, color, shape, position, etc. Because the number of\npossible combinations of these attributes or features grows exponentially (w.r.t the number of possible\nfeatures), the size of D is often exponentially small compared to the support of p(x) in this feature\nspace. Therefore strong prior assumptions must be made to generalize from this very small D to the\nexponentially larger support set of p(x). We will refer to the process of producing q(x) from D as\ngeneralization, and any assumptions used when producing q(x) from D as inductive bias [12].\nDeep generative modeling algorithms implicitly use many types of inductive biases. For example, they\noften involve models parameterized with (convolutional) neural networks, trained using adversarial or\nvariational methods. In addition, the training objective is typically optimized by stochastic gradient\ndescent, contributing to the inductive bias [13, 14]. The resulting effect of these combined factors is\ndif\ufb01cult to study theoretically. As a result, empirical analysis has become the primary approach. For\nexample, it has been shown that on multiple natural image datasets, the learned distribution produces\nnovel examples that generalize in meaningful ways, going beyond pixel space interpolation [15\u201317].\nHowever these studies are not systematic. They do not answer questions such as how the learning\n\n2\n\n# Objects# ObjectsShapeShapeColorColorLearning Algorithm(architecture,hyperparameter,objective, etc...)trainingdistributionmodel distributionLearning Algorithm(architecture,hyperparameter,objective, etc...)FrequencyFrequency\u03c6num\u03c6numSingleModeGeneralization# Objects# ObjectsFrequencyFrequencyMultipleModeGeneralizationLearning Algorithm(architecture,hyperparameter,objective, etc...)\u03c6num\u03c6numABCDMultipleFeatureGeneralization\falgorithm will generalize given a new dataset, or provide insight into exactly which inductive biases\nare involved. The lack of systematic study is due to the high dimensionality of both the input dataset\nD and the output distribution q(z). In fact, even evaluating how \u201cclose\u201d the learned distribution q is\nto p is an open question, and there is no commonly accepted evaluation metric [18\u201320]. Therefore, to\nexamine the inductive bias we need to design settings where the training and output distributions can\nbe exactly characterized and compared.\n\n3 Exploring Generalization Through Probing Features\nWe take inspiration from cognitive psychology, and provide a novel framework to analyze empirically\nthe inductive bias of generative algorithms via a set of probing features. We focus on images but the\ntechniques can also be applied to other domains.\nLet S \u2282 X be the support set of p(x). We de\ufb01ne a set of (probing) features as a tuple of functions\n\u03c6 = (\u03c61,\u00b7\u00b7\u00b7 , \u03c6k) where each \u03c6i maps an input image in S to a value. For example, one of the\nfeatures \u03c6i : S \u2192 N may map the input image to the number of objects (numerosity) in that image\n(Figure 1BC). We denote the range of \u03c6 as the feature space Z. For any choice of p(x), with a\nslight abuse of notation, we denote p(z) as the (induced) distribution on Z by \u03c6(x) when x \u223c p(x).\nIntuitively p(z) is the projection of p(x) onto the feature space Z.\nWhen a learning algorithm A produces a learned distribution q(x), we also project it to feature space\nusing \u03c6. Our goal is to investigate how p(z) differs from q(z), i.e. the generalization behavior of\nthe learning algorithm restricted to the feature space Z. In the input space X even evaluating the\ndistance between p(x) and q(x) is dif\ufb01cult, while in feature space Z we can not only decide if q(z) is\ndifferent from p(z) but also characterize how they are different. For example, if p(z) is a distribution\nover images with red and blue triangles ( , ) and red circles ( ), we can investigate whether q(z)\ngeneralizes to blue circles ( ). We can also investigate the number of colors for circles that must\nbe in the training data so that q(z) generates circles of all colors. Such questions are important to\ncharacterize the inductive bias of existing generative modeling algorithms.\nRelated ideas [21] have been previously used to evaluate the distance between p(x) and q(x).\nIn particular, the FID score [22], the mode score [23] and the Inception score [24] use hidden\nfeatures/labels of a pretrained CNN classi\ufb01er as \u03c6, and measure the performance of generative\nmodeling algorithms by comparing p(z) and q(z) under this projection. In contrast, because we\nwant to study the exact difference between p(z) and q(z), we choose \u03c6 to be interpretable high level\nfeatures inspired by experimental work in cognitive psychology, e.g. numerosity, color, etc.\nUsing low dimensional projection function \u03c6 has an additional bene\ufb01t. Because Z is low dimensional\nand discrete in our synthetic datasets, we are essentially in the in\ufb01nite data regime. In all of our\nexperiments, the support of p(z) does not exceed 500, so we accurately approximate p(z) [25] with a\nreasonably sized dataset (100k-1M examples in our experiments). The interesting observation is that\neven though D is a very accurate approximation of p(z), the learned distribution q(z) is not, so this\nsimpli\ufb01ed setting is suf\ufb01cient to reveal many interesting inductive biases of the modeling algorithms.\nFeature Selection and Evaluation We select features \u03c6 that satisfy two requirements: 1) they\nare important to human perception and have been studied in cognitive psychology, and 2) they are\neasy to evaluate either by reliable algorithms or human judgment. The features studied include\nnumerosity, shape, color, size, and location of each object. For numerosity and shape we use\nindependent evaluations by three human evaluators. The other features are easy to evaluate by\nautomated algorithms. More details about evaluation are presented in the appendix.\nModels To ensure that the result is not sensitive to the choice of model architecture and hyper-\nparameters, we use two very different model families: GAN (WGAN-GP [26]) and VAE [4]. We\nalso use different network architectures and hyper-parameter choices, including both convolutional\nnetworks and fully connected networks. We will present the experimental results for WGAN-GP\nwith convolutional networks in the main body, and results for other architectures in the appendix.\nSurprisingly, we \ufb01nd fairly consistent results for these very different models and objectives. Whenever\nthey differ, we will explicitly mention the differences in the main body.\n\n4 Characterizing Generalization on an Individual Feature\nIn this section we explore generalization when we project the input space X to a single feature (i.e.,\np(z) is a one-dimensional distribution). We \ufb01rst analyze the learning algorithm\u2019s output q(z) when\n\n3\n\n\fthe feature we manipulate contains only one value, i.e., p(z) is a delta function/unit impulse. We ask\nquestions such as: when all images in the training set depict \ufb01ve objects, how many objects will the\ngenerative model produce? One might expect that since the feature takes a single value, and we have\nhundreds of thousands of distinct examples, the learning algorithm would capture exactly this \ufb01xed\nfeature value. However this is not true, indicating strong inductive bias.\nWe call the learned distribution q(z) when the input distribution has a single mode the impulse\nresponse of the modeling algorithm. We borrow this terminology from signal processing theory\nbecause we \ufb01nd the behavior similar to that of a linear \ufb01lter: if p(z) is supported on multiple\nvalues, the model\u2019s output q(z) is a convolution between p(z) and the model\u2019s impulse response. An\nexception is when two modes of p(z) are close together. In this case we \ufb01nd prototype enhancement\neffect and the learning algorithm produces a distribution that \u201ccombines\u201d the two modes. Finally\nwe justify our approach of studying each single features individually by showing that the learning\nalgorithm\u2019s behavior on each feature is mostly independent of other features we study.\n\n4.1 Generalization to a Single Mode\n4.1.1 Numerosity\nExperimental Settings We use two different datasets for this experiment: a toy dataset where\nthere are k non-overlapping dots (with random color and location) in the image, as in the numerosity\nestimation task in cognitive psychology [27, 28], and the CLEVR dataset where there are k objects\n(with random shape, color, location and size) in the scene [29]. More details about the datasets are\nprovided in the Appendix. Example training and generated images are shown in Figure 2, left and\nright respectively.1\n\nFigure 2: Example training (left) and generated images (right) with annotated numerosity. In this\nexample, all training examples contain six dots (top) or two CLEVR objects (bottom), while the\ngenerated examples on the right often contain a different number of dots/objects.\n\nFigure 3: Left: Distribution over number of dots. The arrows are the number of dots the learning\nalgorithm is trained on, and the solid line is the distribution over the number of dots the model\ngenerates. Middle: Distribution over number of CLEVR objects the model generates. Generating\nCLEVR is harder so we explore small numerosities, but the generalization pattern is similar to dots.\nRight: Numerosity perception of monkeys [30]. Each solid line plots the likelihood a monkey judges\na stimuli to have the same numerosity as a reference stimuli. Figure adapted from [30]\n\nResults and Discussion As shown in Figure 2 and quantitatively evaluated in Figure 3, in both\ncolored dots and CLEVR experiments, the learned distribution does not produce the same number of\nobjects as in the dataset on which it was trained. The distribution of the numerosity in the generated\nimages is centered at the numerosity from the dataset, with a slight bias towards over-estimation. For\nexample, when trained on images with six dots (Figure 2, cyan curve) the generated images contain\nanywhere from four to nine dots.\nResearchers have found neurons that respond to numerosity in human and primate brains [27,\n28]. From both behavioral data and neural data, two salient properties about these neurons were\n\n1Code available at https://github.com/ermongroup/BiasAndGeneralization\n\n4\n\n68666768222322210246810121416number of generated dots0.00.20.40.60.81.0frequencytraining #dotsgenerated #dots012345number of generated objects0.00.20.40.60.81.0frequencytraining #objs = 1training #objs = 2training #objs = 3\fdocumented [31]: 1) larger variance for larger numerosity, and 2) asymmetric response with more\nmoderate slopes for larger numerosities compared to smaller ones [31, 30] (Figure 3 right). It is\nremarkable that deep generative models generalize in similar ways w.r.t the numerosity feature.\n\n4.1.2 Color Proportion\nExperimental Settings For this feature we use the dataset shown in Figure 4. Each pie has several\nproperties: proportion of red color zred, size zsize, and location zloc. In these experiments we choose\nthe proportion of red to be 10%, 30%, 50%, 70%, 90% respectively, while the other features (size\nand location) are selected uniformly at random within the maximum range allowed in the dataset.\nDetails about the dataset can be found in the Appendix.\n\nFigure 4: Example images from the training set (Left) and generated by our model (Right). Each\ncircle has a few properties: proportion of the pie that takes color red, radius of the pie and location\n(of center of circle) on the x and y axis.\n\nResults and Discussion The results are shown in Figure 5(Left). We \ufb01nd that the learned feature\ndistribution q(z) is well approximated by a Gaussian centered at value the model is trained on. What\nis notable is that the Gaussian is sharper for small (10%) and large proportions (90%). This is\nconsistent with Weber\u2019s law [11], which states that humans are in fact sensitive to relative change\n(ratio) rather than absolute change (difference) (e.g., the difference between 10% to 12% is more\nsalient compared to 40% to 42%). Unlike numerosity, generalization in this domain is symmetric.\n\nFigure 5: Generated samples for each feature (proportion of color (left), size (middle) and location\nright)). The dashed arrows show the training distributions in feature space (delta distributions), while\nthe solid lines are the actual densities learned by the model. Both size and location are very sharp\ndistributions, while proportion of red color is a smoother Gaussian. Also note that the size distribution\n(middle) is skewed, similar to numerosity.\n\n4.1.3 Size and Location\nWe also use the pie dataset in Figure 4 to explore size and location. Similar to color, we \ufb01x all training\nimages to have either a given size, or a given location, while varying the other features randomly. The\nresults are shown in Figure 5(middle and right). Interestingly we observe that the size distribution is\nskewed, and the model has more tendency to produce larger objects, while the location distribution is\nfairly symmetric.\n\n4.2 Convolution Effect and Prototype Enhancement Effect\nNow that we have probed the algorithm\u2019s behavior when p(z) is unimodal, we investigate its behavior\nwhen p(z) is multi-modal. We \ufb01nd that in feature space the output distribution can be very well\ncharacterized by convolving the input distribution with the the learning algorithm\u2019s output on each\nindividual mode (impulse response), if the input modes are far from each other (similar to a linear\n\ufb01lter in signal processing). However we \ufb01nd that this no longer holds when the impulses are close to\neach other, where we observe that the model generates a unimodal and more concentrated distribution\nthan convolution would predict. We call this effect prototype enhancement in analogy with the\nprototype enhancement effect in cognitive psychology [32, 10].\n\nExperimental Settings For these experiments we use the color proportion feature of the pie dataset\nin Figure 4. We train the model with two bimodal distributions, one with 30% or 40% red (two close\nmodes), and the other with 30% or 90% red (two distant modes). We also explore several other\nchoices of feature/modes in the appendix, and they show essentially identical patterns.\n\n5\n\n0.450.500.550.600.650.700.75radius of generated circle020406080densitytraining distlearned dist0.150.100.050.000.050.100.15location of generated circle0204060densitytraining distlearned dist0.00.20.40.60.81.0proportion of red color0510152025densitytraining distlearned dist\fResults and Discussion The results are illustrated in Figure 6. When the training distribution is\nsuf\ufb01ciently close (top row), the modes \u201csnap\u201d together and the mean of the two modes is assigned\nhigh probability. That is, objects with 35% red are the most likely to be generated, even though they\nnever appeared in the training set. When the modes are far from each other, convolution predicts\nthe model\u2019s behavior very well. Again, these results are consistent for GAN/VAE and different\narchitectures/hyper-parameters (Appendix A).\n\nFigure 6: Illustration of the convolution and prototype enhancement effects. Left: The training\ndistribution p(z) consists of either two modes close together (top) or far from each other (bottom).\nMiddle: The predicted response by convolving the unimodal response with the training distribution.\nRight: The actual q(z) the learning algorithm produces. When the two modes are far from each\nother, q(z) is very accurately modeled by convolution, while for modes close together, the resulting\ndistribution is more concentrated around the mean.\nSimilar observations have also been made in psychological experiments. For example, for a set of\nsimilar examples the participant is more likely to identify the \u201caverage\u201d (which they haven\u2019t seen) as\nbelonging to the set of examples compared to actual training examples (which they have seen) [10].\n\nIndependence of Features\n\n4.3\nIn this section we show that each of the features we consider can be analyzed independently of the\nother. We \ufb01nd that the generalization behavior along a particular feature dimension is fairly stable\nas we modify the distribution in other dimensions. As a result, we can decompose the analysis\nacross dimensions. In fact, Section 4.1.1 already presented some evidence of independence where we\nshowed that learned distribution on numerosity is similar for both dots and CLEVR.\n\nExperimental Setting For these experiments we use the pie dataset in Figure 4. We study all\nthree features: proportion of red color zred, size zsize, and location zloc, and show that the learning\nalgorithm\u2019s response on each is independent of the other features. For each feature, we select three\n\ufb01xed values (0.3, 0.4, 0.9 for proportion of red color, 0.5 0.55 0.8 for size, and -0.05 0.0 0.2 for\nlocation). For each \ufb01xed value of the feature under study, the other features can take 1-50 random\nvalues. For example, when studying if generalization on proportion of red color zred is independent\nof other features, the training distribution p(zred, zsize, zloc) is chosen such that the marginal on\np(zred) is uniform on {0.3, 0.4, 0.9}. If zred is independent on of the other features, the learned\ndistribution q(zred) should only depend on this marginal p(zred) but not p(zsize, zloc|zred). To\nexplore different options for p(zsize, zloc|zred) we select 1-50 random values as the support of this\nconditional distribution. This covers a very wide range of interactions between zred and the other two\nfeatures from strongly correlated (1 value) to very weakly correlated (50 values).\n\nResults and Discussion The learned distribution for each feature as the other features vary is\nshown in Figure 7. We \ufb01nd that the learned distribution for each feature is fairly independent of the\nother dimensions. The only notable change is that there is a slight increase in variance if the other\ndimensions are more random. Interestingly, as the variance increases, modes that did not demonstrate\nprototype enhancement are starting to merge, verifying our previous conclusions. These results are\nconsistent for GAN/VAE and also CNN/FC networks (Appendix A).\n\n5 Characterizing Generalization on Multiple Features\nIn this section we are interested in the joint distribution over multiple features. As we discussed in\nSection 3, the combinations a dataset D covers can be exponentially small compared to all possible\n\n6\n\n0.00.20.40.60.81.0proportion of red colortraining distribution0.00.20.40.60.81.0proportion of red color0246densitypredicted by convolution0.00.20.40.60.81.0proportion of red color0.02.55.07.5densityactual learned distribution0.00.20.40.60.81.0proportion of red color0.00.20.40.60.81.0proportion of red color051015density0.00.20.40.60.81.0proportion of red color0510density\fFigure 7: From left to right: Results on proportion of red color, size, and location. For all three\nfeatures, the learned distribution (green lines) is relative independent of how many modes the other\n\u201cnuisance\u201d features can take (1-50). The only notable difference is that the learned distribution\nhas higher variance if the other \u201cnuisance\u201d features are more random (more modes). As variance\nincreases, some modes may display the tendency of merging together (prototype enhancement).\n\ncombinations in the underlying population distribution p. Therefore we explore when a learning\nalgorithm trained on a few combinations can generalize to novel ones. We \ufb01nd that if the training\ndistribution only contains a small number of combinations (e.g., 10-20) (in a feature space Z) the\nlearned distribution memorizes them almost exactly. However as there are more combinations in\nthe training set, the model starts to generate novel ones. We \ufb01nd this behavior to be very consistent\nacross different settings.\n\nExperimental Setup We use three different datasets:\n1) Pie Dataset: We use the pie dataset as shown in Figure 4. There are four features: size (5 possible\nvalues), x location (9 possible values), y location (9 values), proportion of red color (5 values). There\nare a total of approximately 2000 possible combinations, and we randomly select from 10 to 400\ncombinations as p(z) to generate our training set.\n2) Three MNIST: We use images that contain three MNIST digits. For each training example we \ufb01rst\nrandomly sample a three digit number between 000 to 999, then for each digit, we sample a random\nMNIST belonging to that class (random style). There are 1000 combinations, and we randomly select\nfrom 10 to 160 of them to generate our training set. Example samples are shown in Appendix A.\n3) Two object CLEVR: We use the CLEVR dataset where each object has two properties: its\ngeometric shape and its color. On this dataset, we select one shape that takes only a quarter of the\npossible colors, and one color that is assigned to only a quarter of the possible shapes.\n\nEvaluating Precision-Recall For pie and MNIST datasets, we use precision-recall to compare the\nsupport (evaluation detail in Appendix B) of p(z) and q(z) (Figure 8 Left). Recall is de\ufb01ned as the\nproportion of combinations in the support of p(z) that is also in the support of q(z). A perfect recall\nmeans that all combinations in p(z) appears in the learned distribution q(z). Precision is de\ufb01ned as\nthe proportion of combinations in the support of q(z) that is also in the support of p(z). A perfect\nprecision means that the learned distribution only generates combinations in the training set.\nThe precision-recall between support of p(z) and q(z) is shown in Figure 8. It can be observed\nthat both GAN and VAE achieve very high recall on both datasets (pie and MNIST). This means\nthat there is no mode missing, and essentially all combinations in the training set are captured by\nthe learned distribution. However as the number of combinations increases the precision decreases,\nimplying that the model generates a signi\ufb01cant number of novel combinations. This means that if the\ndesired generalization behavior is to produce novel combinations, one does not need a large number\nof existing combinations, and approximately 100 is suf\ufb01cient. However this can also be problematic\nif one wants to memorize a large number of combinations. For example, some objects may only take\ncertain colors (e.g., swans are not black), and in natural language some words can only be followed\nby certain other words. How to control the memorization/generalization preference for different tasks\nis an important research question.\nWe show in Figure 25 (Appendix) the results are independent of the size of the network. We obtain\nalmost identical IoU curves from small networks with 3M parameters to large networks with 24M. In\naddition, the results are independent of the size of the dataset, and no difference was observed with\nonly half or twice as many training examples. In this task, low precision appears to be inherent for\nGAN/VAEs and cannot be remedied by increased network capacity or more data.\n\nVisualizing Generalization For the CLEVR dataset, we precisely characterize how q(z) differs\nfrom p(z). We use a training set where a shape only takes a quarter of the possible colors and observe\nits possible generalization to other colors. The results are shown in Figure 9. First, we \ufb01nd that similar\n\n7\n\n0.40.50.60.70.80.9size0102030density0.00.20.40.60.81.0proportion of red color02468density-0.10.00.10.20.3location010203040densitytraining dist1 modes2 modes3 modes6 modes10 modes20 modes30 modes50 modes\fFigure 8: Precision and recall for GAN/VAE as a function of the number of training\ncombinations. Left:\nrecall measures intersection/support(p(z)), while precision measures\nintersection/support(q(z)). Middle: precision and recall for pie dataset. Right: precision and\nrecall for three MNIST dataset. Both models achieve very high recall (the support of q(z) contains the\nsupport of p(z)), but precision decreases as the number of combinations increases. Lower precision\nmeans that the learned distribution generates combinations that did not appear in the training set.\n\nFigure 9: Generalization result on CLEVR where we have both shape and color as features. A:\nTraining distribution is uniform, except if the shape is torus-cone, the torus must be blue and the cone\nmust be red in 4x4. B: The marginal is preserved in the learned distribution (e.g. approx. the same\nproportion of torus-cone in both p(z) and q(z)), and a small number of novel colors are occasionally\ngenerated. C: For 9x9, if the shape is cone-cylinder, it must take either red-blue or blue-green. D:\nThe shape takes on all colors almost evenly, indicating clear generalization to all combinations.\n\nto the pie and MNIST experiments, the generalization behavior critically depends on the number\nof existing combinations. If there are few combinations (e.g. 16), then the learning algorithm will\ngenerate very few, if not none, of the combinations that did not appear. If there are more combinations\n(e.g. 81), then this shape does generalize to all colors. What is interesting to note is that the marginal\nis approximately preserved, so each shape is generated approximately as many times as it appeared in\nthe training set. When a rare shape (a shape that appeared in few colors) generalizes to other colors,\nit \u201cshares\u201d the probability mass with colors that did not appear. Similar experiments with rare colors\nare shown in Appendix A.\n\n6 Conclusion and Future Work\n\nIn this paper we proposed an approach to study generative modeling algorithms (of images) using\ncarefully designed training sets. By observing the learned distribution, we gained new insights\ninto the generalization behavior of these models. We found distinct generalization patterns for\neach individual feature, some of which have similarities with cognitive experiments on humans and\nprimates. We also found general principles for single feature generalization (convolution effect,\nprototype enhancement, and independence). For multiple features we explored when the model\ngenerates novel combinations, and found strong dependence on the number of existing combinations\nin the training set. In addition we visualized the learned distribution and found that novel combinations\nare generated while preserving the marginal on each individual feature.\nWe hope that the framework and methodology we propose will stimulate further investigation into\nthe empirical behavior of generative modeling algorithms, as several questions are still open. The\n\ufb01rst question is what is the key ingredient that leads to the behaviors we have observed, since we\nexplored two types of models (GAN/VAE), both of which have two very different architectures,\ntraining objectives, and hyper-parameter choices (Appendix A). Another important direction is to\nstudy the interaction between a larger group of features. We have been able to characterize the model\u2019s\ngeneralization behavior on low dimensional feature spaces, while generative modeling algorithms\nshould be able to model thousands of features to capture distributions in the natural world. How to\norganize and partition such a large number of features remains an open question.\n\n8\n\nsupport(p\u2217(z))support(q(z))intersection10152030506075100150200300400num of combinations0.40.60.81.0scoreGAN recallVAE recallGAN precisionVAE precision1020406080100160num of combinations0.20.40.60.81.0scoreGAN recallVAE recallGAN precisionVAE precisioncone.toruscone.conetorus.torustorus.coneblue.redred.redblue.bluered.bluecone.toruscone.conetorus.torustorus.coneblue.redred.redblue.bluered.blue2438331922261842320230301320105101520253035sphere.conecylinder.conecone.spherecylinder.cylindercylinder.spheresphere.spherecone.conesphere.cylindercone.cylinderred.blueblue.greenblue.bluegreen.redgreen.greenred.redred.greenblue.redgreen.bluesphere.conecylinder.conecone.spherecylinder.cylindercylinder.spheresphere.spherecone.conesphere.cylindercone.cylinderred.blueblue.greenblue.bluegreen.redgreen.greenred.redred.greenblue.redgreen.blue17151615191515135251427221920222062211202321161813617871621181012116182116232321198223821242824241216161513914211331316313141213124181923121118131235101520253035A: Training Distribution (4x4)A: Training Distribution (4x4)B: Generated Combinations (4x4)C: Training Distribution (9x9)D: Generated Combinations (9x9)\fAcknowledgements\nThis research was supported by Intel, TRI, NSF (#1651565, #1522054, #1733686), and ONR.\n\nReferences\n[1] M. Rosenblatt, \u201cRemarks on some nonparametric estimates of a density function,\u201d The Annals\n\nof Mathematical Statistics, pp. 832\u2013837, 1956.\n\n[2] S. Efromovich, \u201cOrthogonal series density estimation,\u201d Wiley Interdisciplinary Reviews: Com-\n\nputational Statistics, vol. 2, no. 4, pp. 467\u2013476, 2010.\n\n[3] S. Arora and Y. Zhang, \u201cDo gans actually learn the distribution? an empirical study,\u201d arXiv\n\npreprint arXiv:1706.08224, 2017.\n\n[4] D. P. Kingma and M. Welling, \u201cAuto-encoding variational bayes,\u201d arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio, \u201cGenerative adversarial nets,\u201d in Advances in neural information processing systems,\npp. 2672\u20132680, 2014.\n\n[6] D. J. Rezende, S. Mohamed, and D. Wierstra, \u201cStochastic backpropagation and approximate\n\ninference in deep generative models,\u201d arXiv preprint arXiv:1401.4082, 2014.\n\n[7] S. Zhao, J. Song, and S. Ermon, \u201cA lagrangian perspective on latent variable generative models,\u201d\n\nProc. 34th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2018.\n\n[8] J. Ho and S. Ermon, \u201cGenerative adversarial imitation learning,\u201d in Advances in Neural Infor-\n\nmation Processing Systems, pp. 4565\u20134573, 2016.\n\n[9] G. A. Alvarez, \u201cRepresenting multiple objects as an ensemble enhances visual cognition,\u201d\n\nTrends in cognitive sciences, vol. 15, no. 3, pp. 122\u2013131, 2011.\n\n[10] J. P. Minda and J. D. Smith, \u201cPrototype models of categorization: Basic formulation, predictions,\n\nand limitations,\u201d Formal approaches in categorization, pp. 40\u201364, 2011.\n\n[11] S. S. Stevens, Psychophysics: Introduction to its perceptual, neural and social prospects.\n\nRoutledge, 2017.\n\n[12] T. M. Mitchell, The need for biases in learning generalizations. Department of Computer\n\nScience, Laboratory for Computer Science Research, Rutgers Univ. New Jersey, 1980.\n\n[13] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, \u201cUnderstanding deep learning requires\n\nrethinking generalization,\u201d arXiv preprint arXiv:1611.03530, 2016.\n\n[14] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, \u201cOn large-batch train-\ning for deep learning: Generalization gap and sharp minima,\u201d arXiv preprint arXiv:1609.04836,\n2016.\n\n[15] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, \u201cInfogan: Inter-\npretable representation learning by information maximizing generative adversarial nets,\u201d in\nAdvances in Neural Information Processing Systems, pp. 2172\u20132180, 2016.\n\n[16] S. Zhao, J. Song, and S. Ermon, \u201cLearning hierarchical features from deep generative models,\u201d\n\nin International Conference on Machine Learning, pp. 4091\u20134099, 2017.\n\n[17] Y. Li, J. Song, and S. Ermon, \u201cInfogail: Interpretable imitation learning from visual demonstra-\n\ntions,\u201d in Advances in Neural Information Processing Systems, pp. 3812\u20133822, 2017.\n\n[18] L. Theis, A. v. d. Oord, and M. Bethge, \u201cA note on the evaluation of generative models,\u201d arXiv\n\npreprint arXiv:1511.01844, 2015.\n\n[19] A. Borji, \u201cPros and cons of gan evaluation measures,\u201d arXiv preprint arXiv:1802.03446, 2018.\n\n9\n\n\f[20] A. Grover, M. Dhar, and S. Ermon, \u201cFlow-GAN: Combining maximum likelihood and adversar-\n\nial learning in generative models,\u201d in AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[21] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. J. Smola, \u201cA kernel method for\nthe two-sample-problem,\u201d in Advances in neural information processing systems, pp. 513\u2013520,\n2007.\n\n[22] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, G. Klambauer, and S. Hochreiter, \u201cGans\ntrained by a two time-scale update rule converge to a nash equilibrium,\u201d arXiv preprint\narXiv:1706.08500, 2017.\n\n[23] T. Che, Y. Li, A. P. Jacob, Y. Bengio, and W. Li, \u201cMode regularized generative adversarial\n\nnetworks,\u201d arXiv preprint arXiv:1612.02136, 2016.\n\n[24] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, \u201cImproved\ntechniques for training gans,\u201d in Advances in Neural Information Processing Systems, pp. 2234\u2013\n2242, 2016.\n\n[25] M. Rosenblatt, \u201cA central limit theorem and a strong mixing condition,\u201d Proceedings of the\n\nNational Academy of Sciences, vol. 42, no. 1, pp. 43\u201347, 1956.\n\n[26] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, \u201cImproved training of\n\nwasserstein gans,\u201d arXiv preprint arXiv:1704.00028, 2017.\n\n[27] A. Nieder, D. J. Freedman, and E. K. Miller, \u201cRepresentation of the quantity of visual items in\n\nthe primate prefrontal cortex,\u201d Science, vol. 297, no. 5587, pp. 1708\u20131711, 2002.\n\n[28] M. Piazza, V. Izard, P. Pinel, D. Le Bihan, and S. Dehaene, \u201cTuning curves for approximate\n\nnumerosity in the human intraparietal sulcus,\u201d Neuron, vol. 44, no. 3, pp. 547\u2013555, 2004.\n\n[29] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, \u201cClevr:\nA diagnostic dataset for compositional language and elementary visual reasoning,\u201d in CVPR,\n2017.\n\n[30] A. Nieder and E. K. Miller, \u201cCoding of cognitive magnitude: Compressed scaling of numerical\n\ninformation in the primate prefrontal cortex,\u201d Neuron, vol. 37, no. 1, pp. 149\u2013157, 2003.\n\n[31] A. Nieder and K. Merten, \u201cA labeled-line code for small and large numerosities in the monkey\n\nprefrontal cortex,\u201d Journal of Neuroscience, vol. 27, no. 22, pp. 5986\u20135993, 2007.\n\n[32] D. J. Smith and J. P. Minda, \u201cThirty categorization results in search of a model.,\u201d Journal of\n\nExperimental Psychology: Learning, Memory, and Cognition, vol. 26, no. 1, p. 3, 2000.\n\n[33] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Ler-\nchner, \u201cbeta-vae: Learning basic visual concepts with a constrained variational framework,\u201d\n2016.\n\n10\n\n\f", "award": [], "sourceid": 6882, "authors": [{"given_name": "Shengjia", "family_name": "Zhao", "institution": "Stanford University"}, {"given_name": "Hongyu", "family_name": "Ren", "institution": "Stanford University"}, {"given_name": "Arianna", "family_name": "Yuan", "institution": "Stanford University"}, {"given_name": "Jiaming", "family_name": "Song", "institution": "Stanford University"}, {"given_name": "Noah", "family_name": "Goodman", "institution": "Stanford University"}, {"given_name": "Stefano", "family_name": "Ermon", "institution": "Stanford"}]}