NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2100
Title:Subspace Attack: Exploiting Promising Subspaces for Query-Efficient Black-box Attacks

Reviewer 1

- The authors proposed a heuristic black-box attack algorithm which improves the query efficiency of black-box attacks by utilizing prior gradients from reference models. The assumption is that gradients from the reference models should also transfer to the target model and thus could help better estimating the true gradient. - The paper did not comes with any theoretical justifications. It is fine for a heuristic method, yet I would expect some more direct evidence to support the key arguments here. In this case, I think the authors made an important assumption implicitly, i.e., the gradient of the reference models should also transfer to the target model just like those adversarial examples (which are shown to have some transferability between models). However, I did not find any direct evidence or supportive experiments to convince the reader that it is indeed true. For example, the authors can consider comparing the cosine similarity between reference model’s gradients and the true gradients, with the cosine similarity through random gaussians and show that the reference model indeed provides more accurate gradient search directions. - SUBSPACE ATTACK? The proposed method is called subspace attack. I guess it is because as the authors showed in Figure 1, using m (less than n) basis vectors, one could still achieve similar attack performances. It is nice to know that but i do not think it is related to the proposed method as in this case the basis vectors are fixed during the experiments while the proposed method uses reference model’s gradient which changes at different points. I think the key idea here is about using reference model’s gradients as basis vectors, not about using less than n basis vectors. From this point of view, I think Figure 1 and “subspace attack” name is a bit misleading. - Suppose the above mentioned transfer assumption is indeed true. The proposed method is working only if we had the reference models at the first place (In other words, only if we know the data distribution first). I would be nice for the authors to consider the case where we do not know the data distribution or at least comment on it. Minor comments: - Line 124, what is adaptive mean, i did not get it clearly - Section 4.1, why it is coordinate descent? I think ZOO method [3] is in the coordinate descent fashion as it only changes one dimension at each step. The proposed method seems not to be this case? Or you just want to say it uses one search direction at each step? ======================= I have read the authors' response and it clears most of my concerns and hence I increase the score

Reviewer 2

The idea of using reference models to guide query-based search is original and quite important in bridging the two lines of approaches to black-box attacks. The algorithm is explained clearly and its performance is evaluated with thorough experiments. The significance of the moving parts of the algorithm is also evaluated with carefully designed experiments. However, I have one major concern with the current draft: Whenever we propose a method in adversarial ML or any security domain in general, it is of utmost importance to first define the threat model clearly. The threat model typically includes a number of components: 1. What's the attacker's goal? Targeted attack or untargeted attack? How many points does he need to attack (important)? 2. What information/resource does the attacker have? Is he given a reference model for free? Or does the reference model come in the cost of a certain number of queries, e.g. the number of training points used to train it? 3. What can the attacker do, e.g. what's the interaction protocol between the attacker and the victim learner? e.g. White-box or Black-box. 4. What's the cost of the attacker? e.g. Number of queries. Depend on the choice of the threat model, the evaluation standard should be different. When comparing different algorithms, it is important to compare them under the same threat model, in which all attackers, though equipped with different attack methods, have access to the same resources. In this particular paper, it is not fair to compare the proposed method, which utilizes a valuable resource, the reference model, with methods that don't, e.g. NES and Bandits-TD. A fair comparison should take into consideration of the potential cost of this reference model. For example, one can consider the cost of the reference model as spending a certain amount of queries as overhead training points for the reference model. Then, depending on the total number of examples the attacker needs to attack (specified in the threat model), this number can be divided between all examples. In the CIFAR10 experiment, the attacker attacks 1000 points, with 2000 training points for the reference model, so the mean and median number of queries should increase by 2 = 2000/1000 to account for this overhead. Then, there will be a trade-off and a choice to be made. When the number of points to be attacked is small, the attacker may not want to spend a large number of queries to train a reference model, whereas when the number of target points is enormous, it is a good strategy to be considered. This is the principled way of doing researches on problems in the adversarial setting, and is highly recommended.

Reviewer 3

The paper improved a previous score-based blackbox attack "Prior convictions: Black-box adversarial attacks with bandits and priors" by leveraging gradients from substitute models. In details, they first train several reference models. When generating adversarial examples, gradients from one reference model will be randomly picked to calculate the search directions. Gradients w.r.t the input image will be approximated based on the calculated search directions. Finally, the adversarial examples are generated through iterative-FGSM/PGD. The proposed method successfully improved the query-efficiency and failure rate of the Bandits-TD method. The improvement comes from the gradients of reference models. However, when training reference model on ImageNet, 75,000 images from an auxiliary dataset and all the images from the ImageNet validation set are used. One concern is that in reality, the adversary may not have access to such large amount of auxiliary images and validation images. Without those images, it will be hard to train a useful reference model thus cannot improve the query-efficiency of blackbox attack. It seems that such efficiency does not come from the method but from extra information. Originality: the idea is not brand new. Works with similar ideas are available online, such as "Improving Black-box Adversarial Attacks with a Transfer-based Prior". Quality: The method is technically sound and the evaluation is fair. But it is not supported by theoretical analysis. Clarity: The paper is clearly written and well-organized. Adequate information and code are provided for reproduction. Significance: Improving the query-efficiency of score-based attack is good but it depends on the availability of extra information.