Review for NeurIPS paper: Noise2Same: Optimizing A Self-Supervised Bound for Image Denoising

NeurIPS 2020

Noise2Same: Optimizing A Self-Supervised Bound for Image Denoising

Review 1

Summary and Contributions: The paper argues that the denoising model trained through mask-based blind-spot approaches is not strictly J-invariant and minimizing the self-supervised loss with a J-invariant function is not optimal for self-supervised denoising. A regularization term that serves as a relaxed J-invariance measure of a model is derived from a self-supervised upper bound. The regularization term is then combined with a non-masked self-supervised reconstruction loss for training the model. Such a method can avoid learning identity mapping while keeping the center pixel value assessible during the self-supervised reconstruction training . This makes it different from existing methods which achieves strict J-invariance during training with blind-spot schemes or specific network designs. The proposed method also shows a little higher efficiency over the existing blind-spot-based approaches.

Strengths: 1. The proposed J-invariance regularization term allows implicitly avoiding learning identity mapping while keeping the center pixel value assessible in the model during self-supervised reconstruction training, which makes it different from existing approaches. 2. Theoretical analysis on the proposed regularization term is provided. 3. A simple yet intuitive self-supervised loss is proposed. The center pixel value is very useful for denoising. Blind-spot methods like N2V cannot assess the center pixel values, and thus the way they learn on the exploitation of the center pixel for denoising is not an optimal way. Specific network architectures with strict J-invariance cannot assess the center pixel value during both training and test, which is not optimal either and some post processing is needed. This paper provides a nice way to learn on dependency on the center pixel value for denoising, in which the center pixel value is directly assessible.

Weaknesses: With a double check on the paper after reading the rebuttal, I have the following concerns. 1. The major one is on correctness of the experimental comparison. I would like to share the results of the two most related blind-spot-based methods: N2S and Laine et al. [12], in their original implementations. The results tell a quite different story from what showed in the paper which used the versions with their own modifications. The original implementations of both N2S [1] and Laine [12] achieve much better results on BSD68, with more than 1.2dB PSNR, than that reported in the paper. Concretely, we use the model trained with the original authors' code and training scheme on BSD400, the same training dataset used in this paper. The test results on BSD68 are as follows: (1) Original N2S can achieve PSNR 28.12dB vs 26.98 reported in this paper with their own modifications. (2) Original Laine et al [12] can achieve PSNR 28.84dB vs. 27.15dB reported in the paper with their own modifications. In short, original implementations of N2S and [12] noticeably outperformed the proposed methods. There is a big gap between the results using original implementations and the one reported in the paper. In supplementary materials, It seems that the authors modified the original implementation (not sure, the description is not very clear). To be honest, I did not see good reasons why only include different results or the results from the modified implementations of the original papers, which are much lower than the results from their original implementations. *********************************************************************** 2. The claim in Section 3.1 is rather confusing. It says that "the denoising function f trained through mask-based blind-spot approaches is not strictly J-invariant, making Equation (2) not valid". Also, in the rebuttal, it says that "it is stated that the results in Table 1 indicate that the model f does not have the J-invariance, thus violating the assumption behind using the loss in Eqn. (3)." In N2S, during training, the denoising function f is J-invariant if we view f as the one equipped with masking. Thus, it does not violate Equation (2). Further, since Equation (2) is the loss for training which is not used in test, relating the J-invariance of a trained model in test to the conditions of Equation (2)(3) for training does not make sense.

Correctness: As described in the weakness, the claim in Section 3.1 is somehow misleading. The paper is trying to argue the importance of center pixel value in denoising. But there is confusion on the concepts of J-invariance/dependencies between training and test.

Clarity: The writing is fine to me. It is suggested that the details of model training/test/results of all compared methods, which are important to reproducibility, should be included in the main paper.

Relation to Prior Work: Adequate.

Reproducibility: Yes

Additional Feedback: I lowered the score due to the issues in the experimental comparison. Please see the weakness. The proposed method can be easily adapted to any existing self-supervised denoising. It is not clear why the author obtained the results of other methods by modifying the original implementation of them, instead of directly applying the proposed method with the same setting as these methods. In addition, there are issues in the claim in Section 3.1, which needs to be addressed.

Review 2

Summary and Contributions: After Rebuttal Summary: During the review phase, I did not realize that the PSNR scores of Laine et al and N2S reported in the paper are lower than what is reported in the original paper. I thank R1 for pointing this out. Given this new observation, I'm afraid that the conclusions in section 3.2 and hence the motivation of the paper doesn't hold anymore. However, I still think that characterizing the weak dependencies networks trained with masking has on the center pixel, and a new objective function that learns these dependencies systematically is interesting. The paper can be accepted if the inaccuracies in the results are fixed. I've updated my score to a weak accept. =========== The authors propose a new framework to train regular (non-J-invariant) neural networks for image denoising without ground truth. They do this via a new objective function that combines the MSE of the output of the network with the noisy image and another term which enforces that the network doesn't use the noisy pixel it is denoising. This loss function enables one to leverage the value of the noisy pixel we are denoising without collapsing the denoising function into identity.

Strengths: The proposed objective function is very straight forward and intuitive. The authors combine MSE of the network output with a noisy image with MSE of the network output of masked noisy image with again noisy image as target. The authors support this cost function by mathematically deriving that it's an upper bound of the supervised loss (eq 6). The proposed cost function works out of the box for different noise types, even when we don't have the knowledge of the distribution. To the best of my knowledge, the proposed work is novel. The work is relevant to the NeurIPS community and of practical importance for denoising.

Weaknesses: The proposed method is computally expensive during the training time, but it is a drawback suffered by all masking based methods. Further, knowing the noise model may give easy prior to incorporate in the framework, which I currently see no way of doing in this framework. Overall, in my opinion, the work doesn't have any significant limitations when comapred to the existing self supervised methods.

Correctness: Yes

Clarity: Yes, I enjoyed reading it.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: According to the observation that J-invariant, which is required by most of the self-supervised denoising networks, leads to sub-optimal results, this paper proposes a novel loss for self-supervised image denoising without this constrain.

Strengths: 1. This paper is well-written, the logic is easy to follow. 2. The authors observe that J-invariant leads to sub-optimal results and propose a new loss to tackle it. 3. The proposed method does not need to know the noise model information.

Weaknesses: My only concern about this paper is the experimental results. Except for the psnr, I do not see a big visual difference relative to existing self-supervised trained methods. I recommend the authors provide some comparisons about visually perceptual metrics e.g. NIQE, BRISQUE. In addition, I think the authors need to provide some comparisons in real noisy dataset e.g. SIDD, NAM, DND. Since the noise model for real noisy images is unknown and it should be more suitable for self-supervised framework. minor issue: Why is the result of BM3D for BSD68 bolded in Table 3?

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: As pointed out by other reviewers, there are misstatements in the experiments in which the psnr of N2S in this paper is lower than in the original N2S paper. The authors need to address this issue.

Review 4

Summary and Contributions: This paper proposed a new method call noise2same for self-supervised image denoising. The proposed method is to optimize an upper bound of typical supervised loss, which contains a reconstruction loss and an invariance loss. Experimental results showed the effectiveness of the proposed method on synthetic noisy images.

Strengths: 1. The paper is well-organized and easy to follow. The idea of using the upper bound for self-supervised image denoising is very interesting. 2. There are lots of analysis and explanation for existing methods and the proposed method, which make the paper more convincing. 3. Consistently better results than the existing self-supervised methods in the experiments.

Weaknesses: 1. The idea of the paper is interesting but seems to be over-claimed. For example, in the abstract, it says the existing methods may be sub-optimal. But this is also true for the proposed method. The proposed method is derived by "an" upper bound of the typical supervised loss. It has not proven to the tightest bound. This is somehow misleading. In addition, the analysis in Sec. 4.2 did not show why the proposed method is better than the baselines. Most conclusions are obtained empirically. 2. The results are on synthetic noise. It will be great to apply the proposed method on real noisy image Benchmarking.

Correctness: Seems correct. Did not check the proof.

Clarity: satisfied.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: Update after rebuttal: My score is lowered because some reviewer mentioned the inconsistency of baseline results. The results were misleading for readers.