{"title": "Multi-objects Generation with Amortized Structural Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 6619, "page_last": 6629, "abstract": "Deep generative models (DGMs) have shown promise in image generation. However, most of the existing methods learn a model by simply optimizing a divergence between the marginal distributions of the model and the data, and often fail to capture rich structures, such as attributes of objects and their relationships, in an image.\nHuman knowledge is a crucial element to the success of DGMs to infer these structures, especially in unsupervised learning.\nIn this paper, we propose amortized structural regularization (ASR), which adopts posterior regularization (PR) to embed human knowledge into DGMs via a set of structural constraints.\nWe derive a lower bound of the regularized log-likelihood in PR and adopt the amortized inference technique to jointly optimize the generative model and an auxiliary recognition model for inference efficiently.\nEmpirical results show that ASR outperforms the DGM baselines in terms of inference performance and sample quality.", "full_text": "Multi-object Generation with Amortized Structural\n\nRegularization\n\nKun Xu, Chongxuan Li, Jun Zhu\u2217, Bo Zhang\n\nDept. of Comp. Sci. & Tech., Institute for AI, THBI Lab, BNRist Center,\nState Key Lab for Intell. Tech. & Sys., Tsinghua University, Beijing, China\n{kunxu.thu, chongxuanli1991}@gmail.com, {dcszj, dcszb}@tsinghua.edu.cn\n\nAbstract\n\nDeep generative models (DGMs) have shown promise in image generation. How-\never, most of the existing methods learn a model by simply optimizing a divergence\nbetween the marginal distributions of the model and the data, and often fail to\ncapture rich structures, such as attributes of objects and their relationships, in an im-\nage. Human knowledge is a crucial element to the success of DGMs to infer these\nstructures, especially in unsupervised learning. In this paper, we propose amor-\ntized structural regularization (ASR), which adopts posterior regularization (PR) to\nembed human knowledge into DGMs via a set of structural constraints. We derive\na lower bound of the regularized log-likelihood in PR and adopt the amortized\ninference technique to jointly optimize the generative model and an auxiliary recog-\nnition model for inference ef\ufb01ciently. Empirical results show that ASR outperforms\nthe DGM baselines in terms of inference performance and sample quality.\n\n1\n\nIntroduction\n\nDeep generative models (DGMs) [19, 26, 10] have\nmade signi\ufb01cant progress in image generation, which\nlargely promotes the downstream applications, es-\npecially in unsupervised learning [5, 7] and semi-\nsupervised learning [20, 6]. In most of the real-world\nsettings, visual data is often presented as a scene\nof multiple objects with complicated relationships\namong them. However, most of the existing meth-\nods [19, 10] lack of a mechanism to capture the un-\nderlying structures in images, including regularities\n(e.g., size, shape) of an object and the relationships\namong objects. This is because they adopt a single\nfeature vector to represent the whole image and conse-\nquently focus on generating images with a single main\nobject [17]. It largely impedes DGMs generalizing to\ncomplex scene images. How to solve the problem in\nan unsupervised manner is still largely open.\nThe key to address the problem is to model the structures explicitly. Existing work attempts to solve\nthe problem via structured DGMs [8, 24], where a structured prior distribution over latent variables\nis used to encode the structural information of images and regularize the model behavior under the\nframework of maximum likelihood estimation (MLE). However, there are two potential limitations of\nsuch methods. First, merely maximizing data\u2019s log-likelihood of such models often fails to capture the\n\nFigure 1: An illustration of the overlapping\nproblem. The \ufb01rst bounding box is in red,\nand the second one is in green. The over-\nlapping area is in purple. De\ufb01ning the prior\ndistribution in the auto-regressive manner\nis still challenging since some locations are\nnot valid even for the \ufb01rst bounding box as\nshown in the right panel.\n\n\u2217Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fstructures in an unsupervised manner [21]. Maximizing the marginal likelihood does not necessarily\nencourage the model to capture the reasonable structures because the latent structures are integrated\nout. Besides, the optimizing process often gets stuck in local optima because of the highly non-linear\nfunctions de\ufb01ned by neural networks, which may also result in undesirable behavior. Second, it is\ngenerally challenging to design a proper prior distribution which is both \ufb02exible and computationally\ntractable. Consider the case where we want to uniformly sample several 20 \u00d7 20 bounding boxes in\na 50 \u00d7 50 image without overlap. It is dif\ufb01cult to de\ufb01ne a tractable prior distribution, as shown in\nFig.1. Though it is feasible to set the probability to zero when the prior knowledge is violated using\nindicator functions, other challenges like non-convexity and non-differentiability will be introduced\nto the optimization problem.\nIn contrast, the posterior regularization (PR) [9] and its generalized version in Bayesian inference,\ni.e., regularized Bayesian inference (RegBayes) [30], provide a general framework to embed human\nknowledge in generative models, which directly regularizes the posterior distribution instead of\ndesigning proper prior distributions. In PR and RegBayes, a valid posterior set is de\ufb01ned according\nto the human knowledge, and the KL-divergence between the true posterior and the valid set (see the\nformal de\ufb01nition in Sec. 2.2) is minimized to regularize the behavior of structured DGMs. However,\nthe valid set consists of sample-speci\ufb01c variational distributions. Therefore, the number of parameters\nin the variational distribution grows linearly with the number of training samples, and it requires an\ninner loop for accurately approximating the regularized posterior [28]. The above computational\nissue makes it non-trivial to apply PR to large-scale datasets and DGMs directly.\nIn this paper, we propose a \ufb02exible amortized structural regularization (ASR) framework to improve\nthe performance of structured generative models based on PR. ASR is a general framework to\nproperly incorporate structural knowledge into DGMs by extending PR to the amortized setting, and\nits objective function is denoted as the log-likelihood of the training data along with a regularization\nterm over the posterior distribution. The regularization term can help the model to capture reasonable\nstructures of an image, and to avoid unsatisfactory behavior that violates the constraints. We derive a\nlower bound of the regularized log-likelihood and use an amortized recognition model to approximate\nthe constrained posterior distribution. By slacking the constraints as a penalty term, ASR can be\noptimized ef\ufb01ciently using gradient-based methods. We apply ASR to the state-of-the-art structured\ngenerative models [8] for the multi-object image generation tasks. Empirical results demonstrate\nthe effectiveness of our proposed method, and both the inference and generative performance are\nimproved under the help of human knowledge.\n\n2 Preliminary\n\nIterative generative models for multiple objects\n\n2.1\nAttend-Infer-Repeat (AIR) [8] is a structured latent variable model, which decomposes an image\nas several objects. The attributes of objects (i.e., appearance, location, and scale) are represented\nby a set of random variables z = {zapp, zloc, zscale}. The generative process starts from sampling\nthe number of objects n \u223c p(n), and then n sets of latent variables are sampled independently as\nzi \u223c p(z). The \ufb01nal image is composed by adding these objects into an empty canvas. Speci\ufb01cally,\nthe joint distribution and its marginal over the observed data can be formulated as follows:\n\np(x, z, n) = p(n)\n\np(zi)p(x|z, n), p(x) =\n\np(x, z, n)dz.\n\nn\n\nz\n\ni=1:n\n\nwith mean \u00b5 =(cid:80)\n\ni=1:n fdec(zi), or a Bernoulli distribution with probability p =(cid:80)\n\nThe conditional distribution p(x|z, n) is usually formulated as a multi-variant Gaussian distribution\ni=1:n fdec(zi)\nfor pixels in images. fdec is a decoder network that transfers the latent variables to the image space.\nIn an unsupervised manner, AIR can infer the number of objects, as well as the latent variables\nfor each object ef\ufb01ciently using amortized variational inference. The latent variables are inferred\niteratively, and the number of objects n is represented by zpres: a n + 1 binary dimensional vector\nwith n ones followed by a zero. The i-th element of zpres denotes whether the inference process is\nterminated or not. Then the inference model can be formulated as follows:\n\nq(z, n|x) = q(zn+1\n\npres = 0|x, z