Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper presents a neural network model for image registration, which generates an arbitrary displacement field to transform the input image in a way that matches the target. This neural network has several components, including a common feature extraction model that results in a 4D tensor with the correlations of local features from both images. The tensor is then transformed into a vector representation of the transformation, and later used to reconstruct a displacement field. COMMENTS Overall, the work is relatively well presented and provides details to understand most of the formulation and solution. However, there are some confusing aspects that could be clarified or stated more prominently. * My understanding is that the components described in section 3.2 and 3.3 are the central contribution of this work. Section 3.1 describes a strategy used before by other researchers, as well as the loss functions, which seem to be standard and adapted for this work. Is this correct? * I found it difficult to understand the motivations behind these two components. While it seems reasonable to use them and the design looks coherent, no much discussion is provided about why the authors think this is the way of modeling the architecture. * The number of 4D conv-layers is not mentioned, so apparently it's only one layer. How critical is the 4D convolution in this architecture? * The geometry of the filters (Sec. 4.1) does not match 4 dimensions: I assume a tensor with dimensions (w,h,w,h), while the size of the kernels is (3,3,3) with channels (10,10,1). Can you clarify? * I understand another interesting component of the proposed network is the displacement field predictor, which replicates the transformation vector n times, with n, the number of 2D points in the displacement field. I could not follow the continuity argument completely, and the smoothness vs complexity either. The authors say that they prove that spatial continuity is guaranteed, but the provided explanations don't seem to be a sufficient proof to me, unless I missed something. * The experimental results seem generally coherent, but the terminology does not quite match the conventions in the solution. More specifically, parametric and non-parametric transformations is something that was not mentioned in the formulation, and it is difficult for the reader to follow exactly what the authors mean. * The writing style needs polishing. In general, the ideas are well organized, but the text needs grammar improvements all around. The current version is distracting and makes it difficult to follow some details. In summary, the paper has interesting ideas, but it still needs to improve quality of presentation significantly to be a robust submission.
Originality I cannot comment much about the originality of this work because I am not familiar with the related works of this research field. To my best understanding, the image encoder, section 3.1 and 3.2 until right before the global descriptor MLP, is proposed by . Putting aside the image encoder, which can be easily plugged in with any better one, I believe the arbitrary continuous displacement field predictor by itself along with the smoothness proof may be a significant contribution that the proposed parameterization of deformation field that is free from any geometric constraint performs better than the previous works with geometric constraints. The baseline methods in experiments are modified to use the same image encoder for fair comparison which strengthens the contribution of this paper. Quality I believe the proposed method is sound and well-motivated. This is a self contained paper with interesting core idea with clear desirable properties. The results are convincing and the baseline seems to be relatively recent works with improvements for fair comparison. Yet, inconsistency in baseline is making the comparison of performance difficult. I believe it would be helpful to report CNNGeo-4D result on experiment 4.3 and use at least the best performing baseline for experiment 4.4 and 4.5. Clarity The paper is well-written and easy to follow. It clearly motivates the problem and explains the details of the network in a degree that it is reproducible. The experiment protocol and the dataset are clearly described and convincing. Minor comment: typo does => does line 148 Significance Although I am not familiar with this field, I find this paper interesting. For previous works, it makes perfect sense to apply pre-defined parameterization of known geometric transformation to model deformation, or to apply additional constraint for smoothness, based on traditional computer vision and graphics knowledge. However, the method this paper is proposing gives me an interesting insight that we instead can train a single continuous function to output a smooth and continuous displacement field without any geometric assumption. From my 3D vision background, this lesson aligns with recent trend of 3D parameterization in a single continuous function such as DeepSDF or OccupancyNet. The paper seems to try to make a fair comparison against recent related works and clearly demonstrates the strength of the proposed method.
This paper takes a reasonable approach the task, and the results do appear competitive with state of the art methods. The authors must clarify which parts of their approach constitute novel contributions as opposed to existing technology. In particular, Section 3.1, appears to contain the same content as Section 3.1 in Neighborhood Consensus Networks by Rocco et al.. Equation (1) is copied verbatim. While this work is cited as  in other sections of the submission, it is **not** mentioned in Section 3.1. A citation as well as discussion must be added. In addition, I would like the authors to clarify this point and make clear their contributions. The proof that the learned displacement field is smooth needs requires some clarification. While it defers to the universal approximation theorem and regularization techniques, these claims do not constitute a proof. Additional discussion and empirical results would be necessary to support the smoothness claim. The authors provide only a few qualitative results and do not discuss failure cases. This would be necessary to better evaluate the work. Minor comments: 13: "is prov*en*" 112: "further *amplify*" 118, 129: commas should be on the same line as the euqations 120-121: some more intuition on the normalization of the correlation tensor would be helpful 147: "trivially prove*n*" 148: "dose" --> "does" 217: "correspondence*s*"