The paper presents an interesting method for multimodal fusion based on feature channel exchanging across modalities. All reviewers recommend acceptance. The AC agrees with the consensus reached by the reviewers and request the authors to improve the related work discussion as pointed out by R1 and add the discussion in the rebuttal to the final version of the paper.