All reviewers recommend acceptance (to varying degrees) after reviewing the author response. The submission focuses on weakly-supervised vision-language grounding and proposes a novel counterfactual contrastive learning objective. Some initial weaknesses with respect to comparison with hard-negative style approaches have been addressed in the rebuttal. I encourage authors to include these results and other suggested revisions in future versions.