Paper ID: | 3589 |
---|---|

Title: | Multi-objects Generation with Amortized Structural Regularization |

Let me begin by stating that my judgement is based on assuming the the derivations are mathematically correct - though I attempted at verifying, I am not certain if everything is free of error. With that in mind: Originality: I believe the central idea of the work is novel, and the derivation is unseen, at least to my limited knowledge of related work. However, the generative model involved (AIR) is borrowed directly from related work, and there have definitely been prior work attempting to regularize the posterior. Though the idea of using structural knowledge to directly shape the distribution of interpretable posteriors such as size and number of objects is novel, I am not so sure if this idea is generalizable enough (see quality and significance section) to warrant a major original contribution. Quality: I believe the derivation is technically sound, though I have not verified the mathematical details and cannot be sure if it is free of errors. The results are convincing and clearly shows the proposed method working. What I am less sure is whether the proposed method is generalizable enough to be applied in a variety of tasks: -First, the examples proposed in the paper all requires the posterior to be simple and interpretable (number of objects, position, size): can this method generalize well enough to other attributes, where it might be non-trivial to infer the property of the posterior? Will the method work if the F in F(q) is more complex e.g. involving a neural network? It is not clear immediately to me if it will work in such cases, especially in practice, so some discussion here might be nice. -From my understanding, it is necessary to hard code specific set of constraints to make the regularization work. However, it is non-trivial to do so. In the bounding box example, the heuristics used penalizes things that are not necessary true e.g. bboxes need to be of similar sizes / stay within the image. My worry is that such regularization encourages the model to select an easy set of solutions, which might achieve better performance than an unregularized model because it avoids the most obvious mistakes, but is incapable to generate actually faithful results (where, e.g. bounding box can be of varying sizes), while a powerful enough unregularized model could theoretically achieve. Could the authors add discussions of when such regularizations are clearly desirable, and provide a more principled way of defining the constraints? With regard to evaluation: -Why is SQAIR/other recent models not compared against? While I get that the purpose is to isolate the effect of regularization, I am not sure if the authors are picking the best toy cases if this is something that current methods can already handling well, as regularization appears to come at a cost. Clarity: The exposition is mostly clear. -Figure 3 and 4 is confusing, as there is no label of data/reconstruction. I am also not show what do the 7th columns mean, why the bounding box for data is the same as reconstruction, and why the examples picked are different for the three methods. It would be also easier to show results side by side across methods - no need to show the data multiple times. Significance: I have mixed feeling about this: although the method proposed is clearly novel in someway and could be applied to a variety of tasks, it appears to me that it is a highly non-trivial effort to find suitable regularization terms for tasks. I would be more convinced about the significance of the work if the authors can provide more discussion on how the framework can be applied in a more generalized setting. Summary: Although I think that a lot more can be added to the paper am not entirely sure whether the proposed method can genuinely work in a more complex setting, I feel that there is certainly value in this paper in proposing a mathematically rigorous (if true) framework that has clear novelty and use cases. I am slightly inclined to accept this paper and can be more sure if the author could address some of my biggest concerns. ------------------- Post-rebuttal comment: I thank the authors for addressing all my major concerns. Though I would love to see more concrete examples, I still think that the conceptual insights are novel enough to warrant an acceptance. (Though I am keeping the original score of 6, I would add that this is very close to a 7 post rebuttal)

# originality: The paper is novel in improving AIR with parametric prior distribution for generative model and re-discovering VAE-type of objective for structured generative models. Human knowledge are formulated as regularization term in this objective. # quality & significance: The paper is in high presentation quality with clear formulation and extensive experiments. The problem is significant since more and more researchers aim to embed human knowledge or physical constraints into unsupervised learning. This paper presents a good working example. # clarity: the paper is well written.

This paper proposes a ASR framework, which adopts a regularization term over the posterior distribution. By this mean, the authors claim that the structures of images can be well captured to facilitate multi-objects generation. The writing of this paper is good and the organization is easy to follow. However, there are some concerns that need to be solved. 1) What does “structure” mean in this paper? “structure” has been mentioned many times, but I still do not understand its specific meaning. Does it mean the shape information? The authors need to explain it more clearly. 2) The novelty of this paper is somewhat marginal. The proposed ASR framework is mainly based on the AIR model, and the added innovation Posterior Regularization (PR) is also an existing method. It seems that the authors only combine the two methods (AIR and PR). 3) In line 141, the authors replace the prior distribution q(Z) with posterior distribution q(Z | X). The authors claim that this operation is based on [17]. However, [17] adopts q(Z | X) to replace posterior distribution p(Z | X), which is different from the operation of line 141. The authors should give a detailed derivation. 4) The meaning of the penalty in Eq. (10) is unclear. Why does the authors add this penalty and what effect does this term have? 5) The experiments are insufficient: - The lack of ablation study. For example, the effect of the operation in line 141 and the penalty in Eq. (10) should be verified. - The lack of parameter sensitivity analysis. For example, is the proposed algorithm sensitive to the parameter lambda in Eq. (10)? - It is highly suggested to conduct experiments on more datasets like [3] [11].