This paper proposes a method for hierarchical, long-horizon planning, by finding good intermediate points between the start state and goal state, then doing the same for the subdivided plans, decomposing the plan into smaller pieces. This goal-conditioned planning is done from image observations, predicting a path given only the initial and end state images. The algorithm outperforms alternative methods based on video interpolation, on scenarios that involve both image-based navigation and robot pick and place. I think reviewers were initially divided on the merits of this paper and what it contributes compared to past work, such as Hindsight Experience Replay (HER) and Searching on the Replay Buffer (SORB), mostly being concerned about the latter. After the rebuttal they converged on the fact that the idea of imagining intermediate goals is useful and concerns about overlap were clarified by the rebuttal. I think the score of 9 is too generous for the paper, given that similar ideas have appeared in the hierarchical generative models literature (even in the video prediction literature) a number of times (e.g. https://zswang666.github.io/P2PVG-Project-Page/, which is not cited but should be). That said, the approach is very promising, and I recommend it be accepted as a poster at the conference.