Part of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)
Ji Feng, Qi-Zhi Cai, Zhi-Hua Zhou
In this work, we consider one challenging training time attack by modifying training data with bounded perturbation, hoping to manipulate the behavior (both targeted or non-targeted) of any corresponding trained classifier during test time when facing clean samples. To achieve this, we proposed to use an auto-encoder-like network to generate such adversarial perturbations on the training data together with one imaginary victim differentiable classifier. The perturbation generator will learn to update its weights so as to produce the most harmful noise, aiming to cause the lowest performance for the victim classifier during test time. This can be formulated into a non-linear equality constrained optimization problem. Unlike GANs, solving such problem is computationally challenging, we then proposed a simple yet effective procedure to decouple the alternating updates for the two networks for stability. By teaching the perturbation generator to hijacking the training trajectory of the victim classifier, the generator can thus learn to move against the victim classifier step by step. The method proposed in this paper can be easily extended to the label specific setting where the attacker can manipulate the predictions of the victim classifier according to some predefined rules rather than only making wrong predictions. Experiments on various datasets including CIFAR-10 and a reduced version of ImageNet confirmed the effectiveness of the proposed method and empirical results showed that, such bounded perturbations have good transferability across different types of victim classifiers.