NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Post-rebuttal: Thank you to the authors for submitting a thoughtful rebuttal. The other reviews didn't raise any new concerns that would sway my original rating. Originality: The main originality of this work is the observation that small perturbations in objects result in valid scenes. This allows the method to solve a much broader problem than Remez 2018, without the need for object detectors and heuristics for where to position the pastes. While Remez is limited to generating object masks for object detectors (e.g. weakly supervised instance segmentation), this work is able to also consider Quality, Clarity: Overall, I think think the writing is clear and the experiments are thorough. Evaluation: The authors do a good job of validating the contributions with ablation studies and show that indeed this foreground shifting is essential to the performance of the technique. I think evaluating with Mask R-CNN is a clever idea for a dataset without annotations, but it raises a greater concern: Why not just train this method on a dataset with a labeled test set? Does this method require very object-centric images? Would this method work for MS COCO? Does it matter that it doesn't work for MS COCO? Further, how would this work compare to Remez in their experimental setting - i.e. how would your method work for learning to generate object masks from detections? It seems natural to do this experiment given the similarity to Remez. One important comment: This method is not fully unsupervised (line 62) -- a huge amount of effort goes into collecting a dataset that only contains the category of interest. I think it's important to clarify this..
Reviewer 2
This paper presents a generative model of layered object representation where a generator synthesizes a foreground object, a foreground mask and a background to compose an image at the same time while a discriminator tells the composite's realism. To prevent the generator from cheating, i.e. generating the same foreground and background with a random mask, this paper proposes to add random shift to the generated foreground object and its mask. The rationale behind this is that the generated foreground object and its mask must be valid to allow such random shifts while maintaining the realism of the scene. In other words, the foreground can move independently on the background in a valid layered scene. As a result, this paper discovered that a high-quality foreground mask emerges from this layered generative model so that an encoder is trained to predict the mask from an image for object segmentation. --Originality A series of layered generative models have been proposed such as Attend-Infer-Repeat (AIR) (Eslami et al., NIPS 2016) and Generative Modeling of Infinite Occluded Objects (Yuan, et al. ICML 2019) but they can only handle synthetic scenes. The "Attribute2Image" model (Yan et al. ECCV 2016) proposed a similar layered generative model but its mask is supervised. To my knowledge, this paper is the first layered generative model that can synthesize real images without mask supervision. A thorough survey on this topic would improve the originality of this paper. This paper discussed some of weakly-supervised/unsupervised object segmentation models but it is not clear how well it compares with. --Quality This paper clearly states the basic assumption of 15%-25% support of an object and formulates a corresponding loss to it (Equation 3). The proposed generative model is trained based on state-of-the-art GAN algorithms and yields high-quality samples on a few object categories. The generated masks in Figure 2 have impressive details. A detailed evaluation of image generation quality, especially comparing with the generative models without layered representations would be very helpful to improve this paper's quality. This paper did a fairly good ablation study on different parameters, but it is not clear to me why it uses the segmentation masks computed from Mask RCNN as ground truth. Why not use the actual ground truth and compare with those supervised results? I believe \eta in Equation 3 is a pretty important parameter that decides how many foreground pixels the model is expected to generate. It would be useful to perform an analysis by varying its value. I'm curious if this paper has considered other priors, e.g. mask connectivity, for unsupervised segmentation. This paper is closely related to the work by Remez et al. in ECCV 2018 as stated, so an experimental comparison with it and other baselines, e.g. grab cut would be useful to improve the paper quality. --Clarity Overall, the clarity is good. As this paper uses 2 generators for foreground and background separately, it will be good to make it clear in Figure 1. --Significance This paper presents a nice step forwards towards unsupervised learning of layered generative models, which have been shown to have the potential for high-quality image synthesis and scene understanding. Unsupervised segmentation is also an interesting problem in computer vision towards reducing human annotations/labors. Overall, I like this paper but also believe it has large room to improve in terms of literature review, experiment setup and evaluation. ======After rebuttal======= I'm mostly satisfied with the rebuttal and request the authors to add literature and baselines as promised.
Reviewer 3
This paper introduces a simple but highly original idea, leading to a really impressive result. The basic idea is similar to Remez '18 "Learning to Segment via Cut-and-Paste", which also uses a GAN to generate a mask in an unsupervised fashion, and a discriminator to distinguish composite images from natural ones. However this paper's use of a three-part generative model, producing a background, foreground, and a mask simultaneously, completely avoids the problem(s) that arise in Remez around trying to identify where the object can be plausibly pasted into a pre-existing background (i.e. must not overlap the original object, placement must be realistic - no cars in the sky, resolutions of background and foreground images must be the same, etc.), as well as the failure mode where the mask only selects part of the object. Thus I believe the paper has considerable novelty. The paper represents a significant step towards learning to segment objects from images without any ground truth mask data. One drawback is that it does require photo-realistic image generation, which currently can only be accomplished for a limited set of classes -- and usually only one class at a time -- by state-of-the-art GAN models. Currently this is somewhat limiting as a segmentation model, but the problem setup feels like a novel image (background + object + mask) generation problem. As such, I could see this having significant impact in both the image generation and segmentation communities. The paper is well written, clear and easy to follow, with some very minor exceptions that I've called out in the "Improvements" section below. The experiments are sufficiently comprehensive, although it would be interesting to see in more detail how the approach performs on images with partially-occluded foreground objects, and to how "central" the foreground object must be. When presented with two foreground objects, does mask generation tend towards a semantic mask (the chairs in Fig 2c.), or does it produce an instance mask for only one of the instances? Overall a very nice paper. I enjoyed reading it. ==== UPDATE: I have reviewed the insightful comments from the other reviewers, as well as the authors' response. I believe it addresses all of the major concerns. An expanded literature review using Reviewer #2's suggestions would be helpful, but the omissions are not fatal to this paper. Based on this I have decided to leave my score as-is.