Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper is clearly written, everything proposed in the paper makes sense and seems like a natural thing to do (I had been working on the same problem, so I am entirely in favor of the pursued direction). Still, the paper is missing comparisons and references to other works in the direction of representing image or video segmentation in terms of a continuous embedding, e.g.: Segmentation-aware convolutional networks using local attention masks, Adam W Harley, et al, ICCV 2017 Dense and low-rank gaussian crfs using deep embeddings, S Chandra, et al, ICCV 2017 S. Kong, et al, "Recurrent Pixel Embedding for Instance Grouping", CVPR 2018 Video object segmentation by learning location-sensitive embeddings, H Ci, et al, ECCV, 2018 X. Wang et al, Non-local Neural Networks, CVPR 2018 Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning Computer Vision and Pattern Recognition (CVPR), 2018. This is not only important for giving credit to earlier works. A more crucial question in connection with these works, is whether the structured layer adds something on top of the ability of a cnn to compute embeddings for image segmentation. In experiments that I have been working on it has been really hard to beat a well-tuned, plain convnet trained with a siamese loss, and introducing a spectral normalization layer only added complications. It would be really useful if the authors could do this comparison on top of a strong baseline (e.g. the methods mentioned above) and indicate whether the resulting embeddings (=eigenvectors) are any better than those delivered from the original baselines. (the method of  is outdated by 4 years).
- The paper is well written and easy to follow. - The proposed idea (combining deep learning and diffusion distance) is clean and well motivated. - The proposed approach can (potentially, though no evidence about it) be applied in other graph applications beyond images. - My big concern is with respect to the evaluation the experimental results. From my understanding (L172), the features are initialized with a network pre-trained on MS-COCO. This means that the ResNet-101 features already embeds strong pixelwise semantic information (acquired with fully supervised training). Therefore it seems not a fair comparison with other method. - I also believe that much of implementation/training details are missing. Eg, what is the optimization algorithm used? What hyper-parameters?
I am not expert to evalute its originality and novelty to semantic segmentation as I am familiar with spectural analysis and diffusion. But my main concerns lie in the experiments. From the method part, I expect it is a general segmentation approach which should be evaluated on standard semantic segmentation dataset. But the experiments lack of such analysis severely. Even on weakly supervised segmentation dataset, it only compared with one baseline.