Summary and Contributions: This paper improves the StyleGAN-based image generation model by disentangling semantics based on a learnable semantics grouping operation, where the styles of the intra-group features are controlled by group-wise adaptive instance normalization and the overall features are re-balanced by inter-group adaptive group normalization. Quantitative and qualitative evaluations show certain improvements over existing methods.
Strengths: - The quantitative evaluations and ablation study validates the effectiveness of the proposed improvements.
Weaknesses: 1. The most critical limitation of this work is its novelty and theoretical soundness. - Disentanglement by grouping similar features is not new and extensive discussed in existing works, such as DBT [10]. However, similarity between layers of a convolutional kernel may not indicate consistent similarity between corresponding feature channels. If not so, the authors should prepare more evidences or discussions. - Mutual information loss, or the framework of InfoGAN, is well-explored in controllable GAN methods. Just employing it to regularize style codes may not be qualified as a valid contribution. [After rebuttal] The feedback discussed the differences with DBT: 1) employing different grouping strategies (uniformly divided channels v.s. adaptively clusterred channels) and 2) being designed for different tasks (classification v.s. unsupervised image generation). I am happy that an additional baseline (DBT as the grouping algorithm) was reported, with measurable performance drop against the proposed method. Thus although the channel-grouping may not new, the proposed group strategy together with intra- and inter-group association is useful in this specific task. 2. Another weakness lies on the qualitative comparison. The functionalities of the proposed modules are hard to observe from these comparisons. For example, - Besides numerous examples to tell the semantics disentanglement of the proposed method, such as Fig. 4 and Fig. 5, there lacks necessary visual comparisons with previous methods. In Fig. 1, the differences between StyleGAN v2 and the proposed method is too subtle as well. - Even the quantitative metrics shown better scores, the inpainting result in Fig. 6 does not have appealing improvement than previous methods. [After rebuttal] I am happy that the author feedback provided more visual results, which are illustrative and helpful.
Correctness: The claims and method may be correct but not convincing right now. They need more in-depth discussions and evaluations.
Clarity: This paper is generally organized well. But there contain numerous typos and grammar errors in the paper. For example, ``hline'' in table 3. ``40w iteration'' -> ``400k iterations'', and etc.
Relation to Prior Work: As indicated above, more discussions about the difference between the previous works are expected in the related work.
Reproducibility: Yes
Additional Feedback:
Summary and Contributions: This paper proposes a similarity-based grouping method for semantic disentanglement in GAN. The core idea lies in clustering channels with the same semantics according to similarity and adaptive group-wise normalization. Experiments are conducted on four datasets and achieved superior performance compared to other methods. In addition, ablation studies are conducted to verify the effectiveness of the proposed method.
Strengths: The method part solved the problem of fine-grained semantic disentanglement based on semantic-aware relative importance. The proposed intra-group and intro-group embedding make use of intermediate latent space to achieve semantic-aware control. Evaluation on LSUN CATS, LSUN CARS, FFHQ, and Paris Street View datasets demonstrate the effectiveness of the proposed method.
Weaknesses: 1. There should be a formal definition of the task in the paper, i.e., unconditional image generation and inpainting. 2. The experimental part does not cover some hyper-parameters in the proposed similarity-based grouping method such as \lambda_1 and \lambda_2 in equation (7).
Correctness: The claims and method are correct.
Clarity: The paper is clearly presented. It would be better to further smooth the the relationship between section 3.1, 3.2, and the objective functions in section 3.3.
Relation to Prior Work: The authors introduce style-based generator and interpretable representation in the related work part. I think the introduction to the prior work is sufficient.
Reproducibility: Yes
Additional Feedback: It is expected to compare the contribution of intra-group and intro-group embedding to the final result, so as to show the effectiveness of each module in the SariGAN.
Summary and Contributions: This paper focuses on learning to disentangle latent factors in image generation tasks. Considering that existing work “styleGAN” can only embed latent code into different image resolutions and control scale-aware image styles, the authors propose to learn a semantic-aware manipulation (on feature channels) by a learned AdaGN operation. Particularly, this disentanglement is achieved by proposing a feature channel clustering module and embedding latent codes into both intra-groups and inter-groups. The resultant synthesized images show SOTA performance on a broad range of image synthesis tasks, including unconditional image generation (on CATs, CARs, and FFHQ face), and conditional image inpainting tasks.
Strengths: 1. This paper proposes to solve a key problem that the latent space in image generation can be further disentangled (not only scale-aware, but semantic-aware) for high-quality generation results and good interpretable properties. 2. The framework of designing channel similarity grouping, and sematic-aware mapping (into both intra- and inter-group) are novel. I think this is a reasonable and solid design for learning fine-grained semantics. 3. The paper is well-written, and the experimental comparison, ablation study, and case studies are comprehensive in validating the effectiveness of the proposed method.
Weaknesses: 1. This paper is suggested to add some details on the discriminator architectures (e.g., how many layers used for different datasets) 2. Will the code be released in the future for a better reproducibility?
Correctness: Yes. The claims and the method are correct. The empirical methodology is correct.
Clarity: Yes. this paper is very clear and well-written.
Relation to Prior Work: Yes. The differences between this work and some previous contributions are clearly discussed in this paper.
Reproducibility: Yes
Additional Feedback: [After rebuttal] After reading the rebuttal and other reviewers' comments, I keep my original rate.