Image Classifier Attacks


Image classifiers have achieved incredibly high accuracy in classifying images (Krizhevsky et al., 2012). Still, no model is perfect and image space is so incredibly large that we can expect pockets in which we can find incorrectly-classified images. Additionally, we can anticipate that there are some areas of image space not represented in datasets that the classifiers do not agree on. Here, we exploit this by creating fake images that are very close to real images in image space but far enough away that they are misclassified.


Neural networks work best with data that follows a standard normal distribution. Images generally do not obey this but a series of transforms allows them to (Krizhevsky et al., 2012). We exploit this feature by engineering spiky noise that “confuses” the classifier, leading to an incorrect label. To make this even more obvious, we decide on an alternative class and try to move the image towards a point that minimises the loss on this class while leaving the image visually very similar.

Classifier attacks are particularly harmful against public APIs, particularly when open source models have been used and approaches like the one outlined here can be applied. However, printing images and holding them up to cameras can also be effective and approaches like these may seem suspicious but the intentions of fooling a classifier may not be obvious to passers by.

In 2018, researchers demonstrated that they could trick the object recognition system in self-driving cars to misclasify stop signs as speed-limit signs. The stickers placed on them looked like random graffiti to passers-by, but had a much greater affect on the system inside the cars (Evtimov et al., 2017).


The aim of a neural network is to minimise a loss function by gradient descent:

$$\theta_{n+1} = \theta_n - \alpha \frac{\partial \mathcal L}{\partial \theta_n}$$

We make two simple changes to the parameters of this equation to achieve the desired result:

  1. Let \(\theta_n\) parameterise the image - not the classifier
  2. Change the loss to be $$\mathcal {\hat L} =\frac{1}{N} \sum_{i=0}^{N-1} w_i\ BCE(f_i(\theta), T_i)$$ where \(f_i\) is the \(i^{th}\) classifier, \(T_i\) is the target class and \(w_i\) is the weight given to the \(i^{th}\) classifier. This is a weighted binary cross-entropy loss.

We also modify the loss for faster convergence by adding a normalisation term to the loss and setting $$w_i = \begin{cases} 1 & \text{if } \text{confidence}_i \ge 95\% \\ 0 & \text{otherwise} \end{cases}$$


For this experiment, the models we used were:

All of these models were all trained on ImageNet and are available with pretrained weights on the PyTorch Hub


This trick works incredibly well - giving images a fine layer of noise that confuses classifiers and achieves the desired incorrect predictions to a very high level of accuracy.


Granny Smith
Model Class Score
AlexNet pomegranate 89.4%
DenseNet-121 pineapple 97.5%
EfficientNet-b0 strawberry 90.2%
Mobilenet-v3 orange 94.1%
ResNet-50 jackfruit 91.6%
SqueezeNet-v1.1 lemon 98.2%
VGG-11 banana 96.1%


Soccer Ball
Name Class Score
AlexNet croquet ball 90.3%
DenseNet-121 ping-pong ball 98.4%
EfficientNet-b0 baseball 98.6%
Mobilenet-v3 basketball 98.9%
ResNet-50 tennis ball 97.6%
SqueezeNet-v1.1 golf ball 98.9%
VGG-11 volleyball 98.1%


Name Class Score
AlexNet brown bear 89.5%
DenseNet-121 cougar 98.6%
EfficientNet-b0 zebra 96.4%
Mobilenet-v3 tiger 97.7%
ResNet-50 triceratops 94.4%
SqueezeNet-v1.1 hippopotamus 98.9%
VGG-11 Komodo dragon 99.0%



Even though the predictions are very poor on the fake images, these images are very brittle. Smoothing them, resizing them or even saving them as a JPG or PNG (which removes most of the noise during compression) results in almost identical classification to the original images.


This process is time-consuming, taking ~6 hours to run on a single laptop GPU (one model loaded at a time). While it can be parallelised for many images and sped up dramatically by leaving all models on the same GPU, this requires more computing resources.


Modern image classifiers are extremely accurate and robust. While the approach presented here manages to create images that are classified incorrectly, the effort and quality of results demonstrate the strength of these models rather than displaying a potential weakness. As a result, there is limited scope for how this could be used maliciously as routine transforms correct most of the damage.


A.Krizhevsky, I.Sutskever, G.Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 2012.

G.Huang, Z.Liu, K.Weinberger. Densely Connected Convolutional Networks. arXiv preprint arXiv:1608.06993, 2016.

M.Tan, Q.Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:1905.11946, 2019.

O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, A.Karpathy, A.Khosla, M.Bernstein et al.. ImageNet Large Scale Visual Recognition Challenge. arXiv preprint arXiv:1409.0575, 2014.

M.Sandler, A.Howard, M.Zhu, A.Zhmoginov, Liang-ChiehChen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. arXiv preprint arXiv:1801.04381, 2018.

K.He, X.Zhang, S.Ren, J.Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.

F.Iandola, M.Moskewicz, K.Ashraf, S.Han, W.Dally, K.Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv preprint arXiv:1602.07360, 2016.

I.Evtimov, K.Eykholt, E.Fernandes, T.Kohno, B.Li, A.Prakash, A.Rahmati, D.Song. Robust Physical-World Attacks on Machine Learning Models. arXiv preprint arXiv:1707.08945, 2017.

K.Simonyan, A.Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.


The code is available on GitHub: