Hiding In Plain Sight: Deep Steganography

Can we design a neural network to find a full-sized color image hidden inside another image? How might changing learning rates and other hyperparameters affect our model performance?.
Krisha Mehta


Steganography is the technique of covering secret data within a regular, non-secret, file, or message to avoid detection. The secret data is then extracted at its destination. The use of steganography can be combined with encryption as an extra step for hiding or protecting data.

In this report, a full-sized color image is hidden inside another image (called cover image) with minimal appearance changes by utilizing deep convolutional neural networks. We will then combine the hiding network with a "reveal" network to extract the secret image from the generated image. 1_6b_zKYI4imcFqSJEm0Xrrw.png

In the above example, the first image is the hidden secret image. The second image is the cover image, which will hide the secret image. The third image is the result of the hiding process, which is basically a redone cover image.

Reproduce the results in colab →

Our Model

I have used this research paper as the inspiration and reference for this article: Hiding Images in plain sight: Deep Steganography.

Our goal is to automate the task of hiding as well as revealing the secret image (or message). The architecture of the network is similar to auto-encoders. 1_fmjtYdceNFmdXOU8wetNDw.png

The network consists of three sections: *The preparation network *The hiding network *The revealed network. We will combine these three parts collectively to form an end-to-end system for hiding as well as revealing the hidden image.


Here is how the research paper describes the functionalities of the networks:

Complete example in colab →

Loss Function

The loss function consists of two error terms: reconstruction loss of cover image, and reconstruction loss of secret image.

L(c, c0, s, s0) = ||c − c0|| + β||s − s0||

Here, c and c0 are original and reconstructed cover images, and s and s0 are original and reconstructed secret images. Beta is a hyperparameter that controls how much of the secret should be reconstructed.

Tiny Imagenet Dataset


We'll use a tiny ImageNet dataset to train our model. The dataset consists of 2000 images of dimension 64x64x3. We'll then make pairs of the cover images and secret images from the dataset. Finally, we'll have 1000 pairs of cover and secret images

# The First half is used for training as secret images, second half for cover images.

# S: secret image
input_S = X_train[0:X_train.shape[0] // 2] 
# C: cover image
input_C = X_train[X_train.shape[0] // 2:]


We will now optimize our model with different combinations of various hyperparameters. The task of choosing the best run according to your needs, can be easily automated using Weights & Biases sweeps, which automatically creates beautiful visualizations.

I have experimented with learning rates, activation functions, and the number of epochs. The following parallel coordinates chart shows the effect of these parameters on the mean Reconstruction loss. W&B sweeps lets you optimize a model by automating the process of selection of the best set of hyper-parameters.

A major takeaway from this chart is that by keeping other parameters the same, the ReLU activation function optimizes significantly faster than tanh. This will be more evident when we take a closer look at the logged outputs.

Complete example in colab →

Section 6

Changing the activation function from ReLU to tanh makes the model optimization process quite unstable at higher learning rates (represented by the two flat lines in the plot). Upon decreasing the learning rate by half, the model starts to do better at optimization, but the model with ReLU as activation function outperforms it.

Section 10

The above visualizations help us pick out the best performing model as well as the most relevant parameter. The model optimized to lowest reconstruction loss was the one with ReLU activation, lower learning rate, and 600 epochs. The parameter importance plot tells us that the learning rate is the most critical parameter affecting the optimization, followed by the activation function and the number of epochs.


Until now, we have looked at the numbers and metrics to study how a particular model's performance is affected by the hyperparameters. Let us now look at the results and compare them side-by-side in order to get a better understanding of how these models automate the task of steganography.

First, let us compare the top two models' outputs, i.e., the models with the lowest final mean image reconstruction loss. The format of the output is as follows:

  1. There are six objective examples, as represented by six rows.
  2. The six columns represent:
    • cover image (Input)
    • secret image (Input)
    • encoded cover (Output of Encoder Network)
    • decoded secret (Output of Reveal Network)
    • the difference between encoded cover and original cover (Diff Cover).
    • the difference between a decoded secret and original secret image (Diff Secret).

Section 8

The Effect of Activation Functions

On the left, we have the output of the model, which ran for 600 epochs. On the right, the output of the model was optimized for 300 epochs.

Some apparent features are differentiating the outputs:

Comparing the results with the first model, we can see that it hides the secret image much better as it leaves almost no trace in the generated cover image. Also, the difference between decoded cover and the original cover is mostly a black picture, which points out that the generated cover image was very close to the original cover image. Hence, it confirms our deduction from the metric visualizations that the first model performs the best.

Section 10

The Effect of Learning Rate

The most obvious difference in the outputs is that the encoded cover of the model with tanh activation fails to maintain the original colors. Other than that, sometimes, it transfers some of the features of the secret image. This problem does not seem to occur with the model using ReLU activation.

Another interesting takeaway from the metric visualization is that the model with tanh activation, while optimized using a higher learning rate, got stuck in some local minima as pointed by the flat mean reconstruction loss curve. Models with ReLU activation functions do not seem to face this problem. Let us see if we can confirm our deduction by visualizing the models' outputs with tanh activations that got stuck in local minimas.

Section 12

It is quite apparent from the above results that our deduction from the metric visualization was on point: both the cover as well as the secret image are lost in the encoded image produced when the activation function used is tanh, and a higher learning rate is used. It is safe to say that the networks did not optimize. This can also be confirmed by the oddly high-quality information present in the Diff cover and Diff Secret rows.

The same deviation is not suffered by models with a ReLU activation function, as seen in the second row of visualizations. The model with a ReLU activation function performs great at a higher(0.001) and a lower(0.0005) learning rate.


Deep steganography improves existing methods for digital steganography as the classical methods are easy to decode, and the amount of information hidden is minimal. Unlike many popular steganographic methods that encode the secret message within the least significant bits of the carrier image, this approach compresses and distributes the secret image’s representation across all available bits.

If you're curious, I've create a full example for you to try in colab →.