This paper studies the interesting but difficult problem of generating physically plausible scenes and videos which contain multiple objects that interact with each other. The generative model is formulated based on GAN by extending a 2D variant of BlockGAN with a relational layer. The proposed model includes a correction network to model object interactions and a structural loss function to evaluate the object poses. The reviewers and AC have read the author feedback carefully in addition to all the reviews during the discussion period. Although the original reviews somewhat diverged, the discussion has helped to converge their views. The proposed method is a combination of BlockGAN and ideas from graph neural networks. Considering the difficulty of the problem studied, the results are quite impressive in generating scenes and videos involving multiple interacting objects. Nevertheless, there are a number of concerns raised by the reviewers, including the issue of physical plausibility and the partial independence assumption. The paper also has room for improvement in its experiments. It is not clear whether the proposed method still works well for more realistic scenes involving more complex and variable interactions among the objects. Three of the four reviewers changed their overall scores during the discussion. Eventually all four reviewers are supportive of accepting this paper. Consequently, it is recommended by the AC to accept the paper for poster presentation. The reviewers have made a number of suggestions to improve the paper. The authors are recommended to consider them seriously for the revision.