Image Classifier Attacks
Introduction
Image classifiers have achieved incredibly high accuracy in classifying images (Krizhevsky et al., 2012). Still, no model is perfect and image space is so incredibly large that we can expect pockets in which we can find incorrectly-classified images. Additionally, we can anticipate that there are some areas of image space not represented in datasets that the classifiers do not agree on. Here, we exploit this by creating fake images that are very close to real images in image space but far enough away that they are misclassified.
Background
Neural networks work best with data that follows a standard normal distribution. Images generally do not obey this but a series of transforms allows them to (Krizhevsky et al., 2012). We exploit this feature by engineering spiky noise that “confuses” the classifier, leading to an incorrect label. To make this even more obvious, we decide on an alternative class and try to move the image towards a point that minimises the loss on this class while leaving the image visually very similar.
Classifier attacks are particularly harmful against public APIs, particularly when open source models have been used and approaches like the one outlined here can be applied. However, printing images and holding them up to cameras can also be effective and approaches like these may seem suspicious but the intentions of fooling a classifier may not be obvious to passers by.
In 2018, researchers demonstrated that they could trick the object recognition system in self-driving cars to misclasify stop signs as speed-limit signs. The stickers placed on them looked like random graffiti to passers-by, but had a much greater affect on the system inside the cars (Evtimov et al., 2017).
Approach
The aim of a neural network is to minimise a loss function by gradient descent:
$$\theta_{n+1} = \theta_n - \alpha \frac{\partial \mathcal L}{\partial \theta_n}$$
We make two simple changes to the parameters of this equation to achieve the desired result:
- Let \(\theta_n\) parameterise the image - not the classifier
- Change the loss to be $$\mathcal {\hat L} =\frac{1}{N} \sum_{i=0}^{N-1} w_i\ BCE(f_i(\theta), T_i)$$ where \(f_i\) is the \(i^{th}\) classifier, \(T_i\) is the target class and \(w_i\) is the weight given to the \(i^{th}\) classifier. This is a weighted binary cross-entropy loss.
We also modify the loss for faster convergence by adding a normalisation term to the loss and setting $$w_i = \begin{cases} 1 & \text{if } \text{confidence}_i \ge 95\% \\ 0 & \text{otherwise} \end{cases}$$
Models
For this experiment, the models we used were:
- AlexNet (Krizhevsky et al., 2012)
- DenseNet-121 (Huang et al., 2016)
- EfficientNet-b0 (Tan et al., 2019)
- MobileNet-v2 (Sandler et al., 2018)
- ResNet-50 (He et al., 2015)
- SqueezeNet-v1 (Iandola et al., 2016)
- VGG-11 (Simonyan et al., 2014)
All of these models were all trained on ImageNet and are available with pretrained weights on the PyTorch Hub
Results
This trick works incredibly well - giving images a fine layer of noise that confuses classifiers and achieves the desired incorrect predictions to a very high level of accuracy.
Apple

Model | Class | Score |
---|---|---|
AlexNet | pomegranate | 89.4% |
DenseNet-121 | pineapple | 97.5% |
EfficientNet-b0 | strawberry | 90.2% |
Mobilenet-v3 | orange | 94.1% |
ResNet-50 | jackfruit | 91.6% |
SqueezeNet-v1.1 | lemon | 98.2% |
VGG-11 | banana | 96.1% |
Ball

Name | Class | Score |
---|---|---|
AlexNet | croquet ball | 90.3% |
DenseNet-121 | ping-pong ball | 98.4% |
EfficientNet-b0 | baseball | 98.6% |
Mobilenet-v3 | basketball | 98.9% |
ResNet-50 | tennis ball | 97.6% |
SqueezeNet-v1.1 | golf ball | 98.9% |
VGG-11 | volleyball | 98.1% |
Pig

Name | Class | Score |
---|---|---|
AlexNet | brown bear | 89.5% |
DenseNet-121 | cougar | 98.6% |
EfficientNet-b0 | zebra | 96.4% |
Mobilenet-v3 | tiger | 97.7% |
ResNet-50 | triceratops | 94.4% |
SqueezeNet-v1.1 | hippopotamus | 98.9% |
VGG-11 | Komodo dragon | 99.0% |
Limitations
Brittleness
Even though the predictions are very poor on the fake images, these images are very brittle. Smoothing them, resizing them or even saving them as a JPG or PNG (which removes most of the noise during compression) results in almost identical classification to the original images.
Resources
This process is time-consuming, taking ~6 hours to run on a single laptop GPU (one model loaded at a time). While it can be parallelised for many images and sped up dramatically by leaving all models on the same GPU, this requires more computing resources.
Conclusions
Modern image classifiers are extremely accurate and robust. While the approach presented here manages to create images that are classified incorrectly, the effort and quality of results demonstrate the strength of these models rather than displaying a potential weakness. As a result, there is limited scope for how this could be used maliciously as routine transforms correct most of the damage.
Bibliography
A.Krizhevsky, I.Sutskever, G.Hinton. ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 2012.
G.Huang, Z.Liu, K.Weinberger. Densely Connected Convolutional Networks. arXiv preprint arXiv:1608.06993, 2016.
M.Tan, Q.Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:1905.11946, 2019.
O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, A.Karpathy, A.Khosla, M.Bernstein et al.. ImageNet Large Scale Visual Recognition Challenge. arXiv preprint arXiv:1409.0575, 2014.
M.Sandler, A.Howard, M.Zhu, A.Zhmoginov, Liang-ChiehChen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. arXiv preprint arXiv:1801.04381, 2018.
K.He, X.Zhang, S.Ren, J.Sun. Deep Residual Learning for Image Recognition. arXiv preprint arXiv:1512.03385, 2015.
F.Iandola, M.Moskewicz, K.Ashraf, S.Han, W.Dally, K.Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv preprint arXiv:1602.07360, 2016.
I.Evtimov, K.Eykholt, E.Fernandes, T.Kohno, B.Li, A.Prakash, A.Rahmati, D.Song. Robust Physical-World Attacks on Machine Learning Models. arXiv preprint arXiv:1707.08945, 2017.
K.Simonyan, A.Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv preprint arXiv:1409.1556, 2014.
Code
The code is available on GitHub: https://github.com/George-Ogden/classifier-attack