NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:32
Title:Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video

Reviewer 1

Overview: The paper proposes a geometric constraint that allows to train ego-motion and depth networks on monocular videos. While the resulting approach is relatively simple, it is very efficient and achieves state of the art or near state of the art results. The authors provide extensive ablation studies and simulations to complement their work. Quality: The quality of this contribution is generally high. Most notably, authors provide extensive ablation studies and use simulations to elucidate the properties of the proposed method. Clarity: The paper is clearly written and is easy to read. Originality: The proposed method is certainly novel and the contribution is original enough. Impact: While other alternative exist that may allow to achieve similar performance, the proposed method is simple, efficient, and does not require expensive-to-obtain datasets. As such, it may prove very useful in practical applications. Conclusion: This paper tackles an important problem and proposes an original solution that may be useful in practical settings. The quality of the experiments is high. Overall, this paper is above the acceptance threshold. # In their response, the authors provided additional details and clarified what they meant by efficiency (mainly training time efficiency). In the discussion, R3 mentioned a particular concern that while efficiency is very important for this contribution, it receives relatively little attention in the article. While it is partially addressed in the rebuttal, I believe that it would be beneficial to further elaborate on how much training time different systems take in previous methods, as it is not entirely clear at the moment and requires re-reading specific aspects of previous SOTA works. In particular, it is important to clarify the phrase "iterative training" the authors used in the response. I.e. it would help to delineate what does every iteration consist of, how many iterations are usually required, how much time each iteration takes, etc. Overall, I agree that these concerns are important, but I believe they could be significantly amended without additional experimentation, by slightly rewriting/expanding relevant parts of the paper. I believe that the authors will be able to do so before the final submission. Moreover, while important, I don't think that those concerns are critical, as they are mostly about how results are presented, not about the results themselves. Therefore, I keep my score unchanged.

Reviewer 2

The originality of the method stems from the combination of a number of known reasonable techniques. Related work is cited well and is clearly differentiated from the proposed method. The visual odometry results seem highly encouraging, both qualitatively and quantitively, but I am not very familiar enough with the related work to be certain. It's not clear how good the KITTI depth results are. For the non-pretrained setting, the proposed method is only marginally better than the cited CC method [10], which as a joint learning method incorporates weaker priors compared to the strong priors required for the method proposed by the authors. The paper is written very clearly and because the proposed method combines rather simple ingredients it is easy to follow along. Overall I think this is a fair submission, it's good to have empirical data that the proposed methods (in particular the geometry consistency) do in fact help. The authors did not significantly exceed state of the art results for depth estimation except for special cases. However for future research, the proposed method can plausibly be combined with others methods to move the state of the art forward, and quantitatively they already provide an improvement for visual odometry tasks. UPDATE: After reading the authors' rebuttal, I have chosen to maintain my score, for to the following reasons: (1) the authors address the concern about the strength of the priors and this is a helpful component of the rebuttal, (2) the authors helpfully explain the training time improvements, as other reviewers have noted, this should be made clearer in the paper and discuss both training and test time differences between the proposed method and prior work. However, (3) the remaining main flaw of the paper is that the contributions are of theoretical and efficiency nature, it would be more convincing to see the impact of such contributions as an increase in final accuracy. For example, this could be shown with a comparison of accuracy between the proposed and other methods after similar amounts of training time/compute.

Reviewer 3

The paper describes an elegant methodology to improves the performance of the very important and difficult task of depth and ego-motion estimation from monocular video. Experimental results validate the efficacy of the proposed methods. The paper is well organized and well written. The paper would benefit from a more thoroughly study of the proposed methods: A mask for the photometric loss is proposed as a more efficient a simpler alternative to estimating optical flow. A natural question is how many milliseconds of GPU time does this save which isn't address.