An ELI5 Introduction to Adversarial Examples

Despite its impeccable success, modern deep learning systems are still prone to adversarial examples. Let's take computer vision for example. Consider an image of a pig (picture to the left). A deep learning-based image classifier can successfully identify this as a pig. Consider another instance of the same image (picture to the right), a slightly perturbed version of the first picture. It still looks like a pig to human eyes, but the image classifier identifies it as an airliner. These perturbations are called adversarial examples.

Check out the code on GitHub →

MIT piggie_0.png

-> Figure 1: To our eyes, the figures are identical, but to an image classifier, they are not the same. This is an example of an adversarial example. <-

What are Adversarial Examples?

What is the typical process of training a deep learning-based image classifier? Here are the quick steps:

  1. We start by building our favorite network architecture.
  2. We define the hyperparameters like the number of epochs, batch size, optimizer, etc.
  3. We then feed batches of preprocessed images as tensors to our network.
  4. We measure the loss of the network with the ground-truth labels and the predicted labels.
  5. We then calculate the gradients of the loss function with respect to the network’s learnable parameters.
  6. We backpropagate these gradients to update the learnable parameters of our network such that the loss is minimized.

Here is a pictorial depiction of the above process:

-> Figure 2: Pictorial overview of the training process in deep learning (source). <-

Let us take a step back and reflect on steps 5 and 6 -

  1. We then calculate the gradients of the loss function with respect to the network’s learnable parameters.
  2. We backpropagate these gradients to update the learnable parameters of our network such that the loss is minimized.

Two things to catch here is -

Now, what is even more tempting to explore is -

Suppose we can update the input data so that the training objective gets maximized instead of minimized. In that case, essentially, the updated input data can be used to fool our network. Of course, we would not want our updated input data to look visually implausible (in computer vision). For example, an image of a hog should still appear as a hog to the human eyes, as we saw in Figure 1. So, this constrains us in the following aspects -

Why care about gradients only? Because this quantity will inform us how to nudge the input data to affect the loss function.

Do not worry if the concept of adversarial examples is still foreign to you. We hope it will become more evident with the few next sections, but here is a pictorial overview of what is coming.

-> Figure 3: Pictorial overview of the FGSM method (source). <-

Types of Adversarial Examples

Adversarial examples can mainly come in two different flavors to a deep learning model. Sometimes, the data points can be naturally adversarial (unfortunately !) and sometimes, they can come in the form of attacks (also referred to as synthetic adversarial examples). We will be reviewing both the types in this section.

Natural Adversarial Examples (!)

Ever wondered what if an image is naturally adversarial? What could be more dangerous than that? Consider the following grid of images for example -

-> Figure 4: On the left-most corner of each image, we see the ground-truth label, and on the right-most corner, we see the predictions of a standard CNN-based classifier (source). <-

In their paper Natural Adversarial Examples, Hendrycks et al. present these naturally adversarial examples (along with a dataset called ImageNet-A) for classifiers pre-trained on the ImageNet-1K dataset. To assess how dangerous could these examples be here is a phrase from the abstract of the paper -

[…] DenseNet-121 obtains around 2% accuracy, an accuracy drop of approximately 90% […]

To summarize potential reasons as to why these situations might be occurring, the authors emphasize the following points -

Geirhos et al. studied this phenomenon in their paper ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. They first stylize the ImageNet-1K training images to impose bias toward shape in the CNN-based classifiers. Unfortunately, this idea does not transfer well in the case of the ImageNet-A dataset.

What is worse is the traditional adversarial training methods do not help to prevent the standard CNN-based classifiers from these examples. To help the networks tackle these severe problems, Hendrycks et al. suggest using self-attention mechanisms since it helps a network better capture long-range dependencies and other forms of interactions from its inputs.

Natural adversarial examples are a pain. However, it is also possible to take a natural image and turn that into an adversarial example synthetically.

Taming the Input Data Points as Attacks (Synthetic)

In this part, we will be exploring the following questions we raised with code in tf.keras. For now, we will be basing our exploration with respect to image classification models. However, note that adversarial examples are a general threat and can be extended to any learning task.

What if we calculated the gradients with respect to the input data? What if the input data were updated (with the gradients) to maximize the loss?

So, here is how you would run an inference with a pre-trained ResNet50 model in tf.keras -

Full code here →

# Load ResNet50
resnet50 = tf.keras.applications.ResNet50(weights="imagenet")

# Run inference
preds = resnet50.predict(preprocessed_image)

# Process the predictions

If we pass the following image to resnet50 and process the predictions we would get - [('n02395406', 'hog', 0.99968374), ('n02396427', 'wild_boar', 0.00031595054), ('n03935335', 'piggy_bank', 9.8273716e-08)] (top-3 predictions).

-> Figure 5: A sample image for running inference with an image classifier. <-

Now, these labels we are seeing come from the ImageNet dataset, and hog corresponds to the class index of 341. So, if we calculate the cross-entropy loss between 341 (ground truth label) and resnet50's predictions, we can expect the loss to be low -

cce = tf.keras.losses.SparseCategoricalCrossentropy()
loss = cce(
# 0.0004157156

Turning this loss into probability gives us the number (not the exact one but pretty close) we saw above - np.exp(-loss.numpy()) - 0.9995844.

Now, in order to perturb the image, we will be adding a quantity $\delta$ to it such that when it gets added to the input image, the loss increases. Keep in mind that while perturbing the image, we will not be disturbing its visual semantics i.e., to our eyes, the image should be appearing as a hog. Enough talk, let us see this in action -

Full code here →

scc_loss = tf.keras.losses.SparseCategoricalCrossentropy()
for t in range(50):
    with tf.GradientTape() as tape:
        inp = preprocess_input(image_tensor + delta)
        predictions = resnet50(inp, training=False)
        loss = - scc_loss(
        if t % 5 == 0:
            print(t, loss.numpy())

    # Get the gradients
    gradients = tape.gradient(loss, delta)
    # Update the weights
    optimizer.apply_gradients([(gradients, delta)])

    # Clip so that the delta values are within [0,1]

Section 6

Here’s how clip_eps is implemented -

# Clipping utility so that the pixel values stay within [0,1]
EPS = 2./255

def clip_eps(delta_tensor):
    return tf.clip_by_value(delta_tensor, clip_value_min=-EPS, clip_value_max=EPS)

I used Adam as the optimizer with a learning rate of 0.1. Both the EPS and learning rate hyperparameters play a vital role here. Even the choice of optimizer can significantly affect the loss landscape.

This form of gradient descent-based optimization is also referred to as Projected Gradient Descent.

If we finally add delta to our original image, it does change that much -

-> Figure 6: A synthetically generated adversarial example. <-

And when we run inference -

# Generate prediction
perturbed_image = preprocess_input(image_tensor + delta)
preds = resnet50.predict(perturbed_image)
print('Predicted:', tf.keras.applications.resnet50.decode_predictions(preds, top=3)[0])

We get - Predicted: [('n01883070', 'wombat', 0.9999833), ('n02909870', 'bucket', 9.77628e-06), ('n07930864', 'cup', 1.6957516e-06)]. We don’t see any sign of hog in the top-3 predictions 😂

Okay, this is fun, but what if we wanted the model to predict the class of our choice? For example, I want the model to predict the hog as a Lakeland_terrier. Note that this target class should belong to the ImageNet dataset in this case. Generally, we would want to choose this target class on which a given model has been trained on. This attack is also known as targeted attack.

The only part that gets changed in the previous code listing we saw is this one -

loss = - scc_loss(

Instead, we do -

loss = (- scc_loss(tf.convert_to_tensor([true_index]), predictions) + 
    scc_loss(tf.convert_to_tensor([target_index]), predictions))

Full code here →

This expression allows us to update the delta in such a way that the probability of the model to predict the true class gets minimized and to predict the target class gets maximized. After a round of optimization, we get the same image (apparently) but get the predictions as - Predicted: [('n02095570', 'Lakeland_terrier', 0.07326998), ('n01883070', 'wombat', 0.037985403), ('n02098286', 'West_Highland_white_terrier', 0.034288753)].

-> Figure 7: A synthetically generated adversarial example (targeted this time). <-

We just saw how to fool one of the most influential deep learning architectures in merely ten code lines. As responsible deep learning engineers, this should not be the only thing we would crave for. However, instead, we should be exploring ways to train deep learning models with some adversarial regularization. We will be getting to that in a later report.

At this point, you are probably wondering how does delta look like? Absolutely random.

-> Figure 8: The learned delta vector. <-

Moreover, to get delta to an observable condition you would need to zoom it quite a bit - plt.imshow(50*delta.numpy().squeeze()+0.5).

Until now, the techniques we studied are mostly based around images but note that these apply to other modalities of data. If you are interested to know more about how it may look like in text, check out TextAttack.

A Note on Optimizer Susceptibility to the Synthetic Attacks

This section presents the susceptibility of different optimizers to the synthetic attacks. We have only seen the Adam optimizer in action till now (with a learning rate of 0.1 and other default hyperparameters of tf.keras). Let’s see what the other optimizers respond to the attacks.

Section 9

Grad-CAM on Perturbed Images

This section shows the class activation maps retrieved from our network's predictions on different perturbed images. I used the code from this blog post by PyImageSearch to generate these activation maps.

Section 11


I hope this report provided you with an intuitive introduction to adversarial examples in deep learning. The techniques we discussed are inspired by the Fast Gradient Sign Method, as demonstrated by Goodfellow et al. in their paper - Explaining and Harnessing Adversarial Examples. Additionally, this paper discusses a wide variety of different adversarial attacks and, most notably, how to defend against them.

You might be wondering if adversarial examples are so negatively impactful why can’t we just train our networks to be adversarially robust. Well, of course, you can! This report could have been an overkill if I included it. This report is a part of a mini-crash course on adversarial examples and adversarial training in deep learning, so definitely stay tuned if you are interested to know more about this field, including training neural networks to be adversarially aware.

Huge thanks to Ishaan Malhi and Jack Morris for providing inputs on the early drafts of the report.