Good work on analyzing pros and cons of various object representations, as well as a neat way to combine them into a single framework that gives good gains on the COCO benchmark. The proposed solution of using a self-attention module to bridge the representations is both original, simple and widely-applicable. I think the method and the work reveal intriguing differences between the various representations and this will be useful to the community. The authors should adapt the camera ready in accordance to the post-rebuttal comments from the reviewers (esp. as it concerns more fine-grained statements about the contributions of this work).