Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
All reviewers converged to acceptance scores for this submission, and the AC agrees with this consensus. However, in preparing a final version for publication, I would recommend a couple of modifications. First, the theoretical analysis from Section 3.2 relating to gradient priority seemed distracting and unnecessary. It is seemingly quite obvious that the gradient magnitude of the inliers or outliers will increase proportionally to the number of samples, and it is not at all clear what additional benefit we get from all the derivations and heuristic approximations presented here (on this point I completely agree with Reviewer 1). Additionally, I don't see how any of this differentiates the proposed method from standard approaches based on AEs. Certainly they would also experience a similar gradient priority if the inlier proportion was greater than the outliers. In any event, by condensing Section 3.2, then there would be more space for moving important modeling details from the supplementary or adding additional experiments. Secondly, the proposed model relies on a 10-layer wide residual network for comparative testing, while alternative AE baselines are built on simpler 4-layer encoder/decoder pairs with no residual connections. Is this really a fair benchmark? In this regard, how can we tell what benefit comes from the proposed surrogate supervision and what is merely the result of different capacity or network structure? Further testing with richer AE structures would be more convincing.