This paper makes it possible to learn Lagrangian dynamics from images and use them for energy-based control. This represents an important and significant advance for this fledgling new research subfield of physics-aware prediction, which might very well go on to prove important and significant in the coming years. I believe the reviewers are all in agreement on this point. However, by entering this new territory for physics-aware prediction, this paper has also exposed itself to interest from a broader community of readers and NeurIPS attendees who are familiar with the progress in image-based *intuitive physics* modeling and control methods over the last 5 years or so (R2 and R4 point to some such approaches). A lot of the difficulty in arriving at a reviewer consensus for this paper can be put down to the fact that its positioning is somewhat myopic and ignores this broader context, perhaps because the authors themselves might not be familiar with these approaches. However, I would urge the authors to position their work within this broader image-based prediction and control, to help introduce not only their work but this family of approaches to a broader audience. For maximizing impact, this might even mean evaluating such methods as baselines (e.g. Ebert '18) --- note that you may not necessarily have to beat those methods given your other advantages; but these comparisons are important to place this approach in its appropriate context. Even within the physics-aware prediction literature, some references and comparisons and a clear statement of assumptions are missing, as is a clear statement of the claims, though the rebuttal has begun to address these. In particular, comparison to Toth '20 (in zero-control settings) is important. These above changes, had they already been made, would have allowed me to recommend a strong accept without many reservations. At present, I am only able to tentatively recommend a poster presentation, while urging the authors to incorporate these suggestions into their camera-ready.