NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 7817 Defending Neural Backdoors via Generative Distribution Modeling

### Reviewer 1

Overall, I think this paper presents a valuable idea toward backdoor defending. The basic idea is to recover the triggers, while the main technical challenge that it solves is that different triggers being recovered may lead to poor defense performance. This work thus, instead of recovering one trigger, tries to model a distribution of the triggers. The technical development in Sec 3 makes sense, and the evaluation in Sec 4 demonstrates that the approach is valid. There are also several aspects that can be improved. The presentation (esp. the notations) in Sec 3 is very dense and hard to follow. For example, $g$ is a function, but I'm not sure what $g'$ is, and why it can be compared with a scalar $\omega$. Are $g$ and $g'$ just two linear transformation? The evaluation is done over CIFAR-10, while the triggers are all black/white. This may have some problem, since black-white triggers are clearly out of the distribution. The target class is set to c=0, and the authors argue that all classes are equivalent. But actually, they are not. It is natural to expect that some classes are easier to attack than others, as is in the case of adversarial example studies. Nevertheless, I think this is an OK paper, and the results are worth to be published.

### Reviewer 2

-- The proposed method is novel and has some nice properties. It recovers the distribution of triggers, thereby brings better robustness to the network. -- The experiments are comprehensive and demonstrate the effectiveness and robustness of the model. -- This work is well-written with barely grammar mistakes. -- Figure 1 can be reorganized for better clarity. -- I do not see major weaknesses in the work. The experiments on more datasets like CIFAR100 and Tiny ImageNet could be a plus.

### Reviewer 3

Originality: This paper contains several points of novelty: discovery of the continuous nature of trigger distributions, defining a new generative procedure -- MESA -- from an unknown distribution without samples, and applying MESA to defense against backdoor triggers to achieve significantly improved performance. Quality: The proposed approach is principled and can effectively learn an unknown distribution without samples when given enough computational resources. The assumption that the testing function F is the composition of the density f and a monotonic function g is somewhat idealistic, but this assumption seems to hold reasonably well for the task for backdoor trigger detection. Clarity: This is the main area of weakness for this paper. The description of MESA (Section 3.1) is very messy and unstructured, and it took several passes for me to understand the method. In particular, I could not fully appreciate why the ensemble was necessary until realizing that each G_i is learning a level set of the true density function. To help the reader understand and appreciate the result, the authors should devote significant effort to clearly explaining the method, e.g. via a step-by-step breakdown of Figure 2(a). Another minor question is regarding the experimental setting. When randomly sampling the location to apply the trigger, is the location fixed for all triggers or is it also variable? In this case, does F need to randomly apply the trigger to a chosen location and repeat several times to compute the average success rate across locations? If not, I see this as a flaw in the evaluation setting as it is unreasonable to assume that the defender knows the (random) location of the trigger. Significance: The improvement that MESA makes over prior defense is significant as demonstrated by Figure 4. This work makes several important discoveries central to the testing-stage defense of backdoor trigger attacks. I can foresee an array of future work that proceeds in this direction. One clear limitation of this method is that the support set of the distribution that MESA models must be small. In particular, the authors limit to 51 patterns in their evaluation as opposed to all 3x3x3 (CxHxW) tensors containing RGB values between 0 and 255. This is reasonable as a first step, but I am concerned that this approach is limited in principle by the size of the support set in that the computation may scale linearly with the number of supported elements. ---------------------------------------------------------------------- I have read the author response and decided to maintain my evaluation.

### Reviewer 4

(1) This paper is clearly written. Useful visualizations (e.g., Figure 3 & 5) are provided for better understanding of this paper. (2) This paper is well motivated --- by empirically show that there indeed exists a distribution for backdoor triggers, it then makes a perfect sense that we should apply a generative model to characterize this distribution for obtaining strong defense model. (3) The proposed method is novel. As the backdoor trigger distribution is unknown, traditional generative modeling methods, like GAN or VAE, which need directly sampling from the distribution is not applicable in this situation. Therefore, the author proposes a sampling-free algorithm (which based on entropy maximization) to model the trigger distribution. (4) Compared to traditional reverse engineering method, experiment results on CIFAR-10 demonstrate that the proposed method can improve and stabilize the robustness of defense models.