Skip to main content

Keras Dense Layer: How to Use It Correctly

In this article, we'll look at the Dense Layer in Keras so that you can build a thorough understanding that will be vital when building custom models in Keras.
Created on April 28|Last edited on May 22
The Dense layer in Keras is a good old, fully/densely-connected neural network. There's nothing more to it! However, understanding it thoroughly will go a long way while building custom models in Keras.
In this article, we'll explore what the dense layer is and how it works in practice so that you have everything you need.
Here's what we'll be covering:

Table of Contents



What Are Dense Layers?

In machine learning, a fully connected layer connects every input feature to every neuron in that layer. A dense layer is mostly used as the penultimate layer after a feature extraction block (convolution, encoder or decoder, etc.), output layer (final layer), and to project a vector of dimension d0 to a new dimension d1.
Let's consider a 1D input feature:
This input is processed using a fully connected layer with 4 neurons (we will ignore the linearity and bias for simplicity). How many connections do we get - 3 * 4 = 12, i.e., every value in the 1D feature space is multiplied by 4 weight vectors (represented by color lines), as shown in the figure below.
Fig1: A fully connected neural network.

Show me code

We'll use Keras to implement the fully connected neural network shown in Figure 1. The Dense layer can be used to achieve this, as shown below:
import tensorflow as tf
from tensorflow.keras import layers

# Let's create a random input feature shape with batch_size of 1.
inputs = tf.random.uniform(shape=(1,3)) # (batch_size, num_features)

# Initialize a fully connected layer (we will not be using bias for simplicity).
dense_layer = layers.Dense(unit=4, use_bias=False)

# What should we expect as output?
print(dense_layer(inputs))
>>> tf.Tensor([[-0.07313907 0.31886736 -0.83136106 0.47411117]], shape=(1, 4), dtype=float32)
For an input of shape (1,3), you get an output of shape (1,4) since four neurons are in the fully connected layer. In Fig 1, 12 lines are drawn representing the weights. Let's inspect the initialized weights for our fully connected layer:
print(dense_layer.weights)
>>> [<tf.Variable 'dense_1/kernel:0' shape=(3, 4) dtype=float32, numpy=
array([[-0.18026423, 0.8457761 , 0.20618927, 0.34542954],
[-0.68638337, -0.09881872, -0.891773 , 0.7983763 ],
[ 0.84850013, -0.5968083 , -0.522443 , -0.62690055]],
dtype=float32)>]
As expected, it is a matrix of shape (3, 4) (3*4 = 12) .

How About An N-Dimensional Input?

Now let's suppose we have an input of shape (time/num_frame/arbitrary_feature, features). This input can be a sequence (time series or video) or an arbitrary feature space. Depending on the use case, you can pass (or process) this input through a fully connected layer in multiple ways. Let's consider a few before we go down the rabbit hole.

1. It's an Arbitrary Feature Space

The input can be some arbitrary feature space of shape (arbitrary_feature, features). You can consider and experiment with flattening the input. What does input flattening look like in Keras? Let's find out!

1.1 Input Flattening

In Keras, one can use the Flatten() layer to flatten any input into a 1D vector. This layer doesn't flatten along the batch dimension, i.e., if the input has a shape of (32, 2, 3) where 32 is the batch size. The flattening operation will give a vector of shape (32, 6).
# Let's initialize a constant input of shape (1, 2, 3) where 1 is the batch size.
inputs = tf.constant(
[[[1, 1, 1], [2, 2, 2]]]
)
# Initialize a flattening layer
flatten = layers.Flatten()
# What should we expect as the output?
outputs = flatten(inputs)
print(outputs)
>>> tf.Tensor([[1 1 1 2 2 2]], shape=(1, 6), dtype=int32)
We get a flattened vector the shape (1, 6) where 1 is the batch size. This layer is usually used after a feature extraction block in a deep neural network. This is also valid to use where the arbitrary_feature dimension is independent, i.e., one can consider features not part of a sequence. Let's pass this through a dense layer:
dense_layer = layers.Dense(
units=4, # We have 4 neurons in this fully connected layer.
use_bias=False,
kernel_initializer=tf.keras.initializers.Constant(value=0.5) # The weights are initialized with a constant value of 0.5.
)

# What should we expect as the output?
print(dense_layer(outputs))
>>> tf.Tensor([[4.5 4.5 4.5 4.5]], shape=(1, 4), dtype=float32)
How did we get this value? After the fully connected layer, each output value (4.5) is computed like 1*0.5 + 1*0.5 + 1*0.5 + 2*0.5 + 2*0.5 + 2*0.5 = 4.5.

2. It's a Sequence

Now let's consider an input of shape (time, features). Here each features is dependent along the time axis, and a flattened vector would lose this dependence. Consider a scenario where we want to project the dimension of the features to a new dimension. Can we use Keras Dense() layer to do so?
As per the documentation:
If the input to the layer has a rank greater than 2, then Dense computes the dot product between the inputs and the kernel along the last axis of the inputs and axis 0 of the kernel. For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units).
Let's break it down and understand each moving part with code.
# The input is of shape (1, 2, 3) where 1 is the batch size and 2 is the time axis.
inputs = tf.constant(
[[[1, 1, 1], [2, 2, 2]]]
)

# The rank of input is greater than 2?
print(tf.rank(inputs))
>>> <tf.Tensor: shape=(), dtype=int32, numpy=3>
Let's use the Dense layer to project the last dimension from 3 to 4.
# Initialize a dense layer with 4 outout neurons and a constant weight of 0.5.
dense_layer = layers.Dense(
units=4,
use_bias=False,
kernel_initializer=tf.keras.initializers.Constant(value=0.5)
)

# What should be the expected output?
print(dense_layer(inputs))
>>> tf.Tensor(
[[[1.5 1.5 1.5 1.5]
[3. 3. 3. 3. ]]], shape=(1, 2, 4), dtype=float32)
As you can see, the Dense layer projected the input of shape (1, 2, 3) to (1, 2, 4). We got the output value of 1.5 because of this computation: 1*0.5 + 1*0.5 + 1*0.5 = 1.5. Similarly, we got the output value of 3.0 because of this computation: 2*0.5 + 2*0.5 + 2*0.5 = 3. As per the documentation, the weight matrix's shape should be (3, 4).
print(dense_layer.weights)
>>> [<tf.Variable 'dense_5/kernel:0' shape=(3, 4) dtype=float32, numpy=
array([[0.5, 0.5, 0.5, 0.5],
[0.5, 0.5, 0.5, 0.5],
[0.5, 0.5, 0.5, 0.5]], dtype=float32)>]

2.1 TimeDistributed Layer

We looked from the perspective of projecting our (time, features) input to a different dimension, but how about we want to apply a Dense operation (dot products) to each feature sequentially?
To add more nuance to this discussion, imagine a video data sample of shape (num_frames, height, width, 3). You would like to extract information from each frame sequentially using a pre-trained image model. The TimeDistributed layer allows you to apply a layer (feature extractor here) to every temporal slice (frames here) of an input (video here).
Let's see how we can apply a Dense operation to an input of shape (time, features). We will use the inputs initialized in the previous example:
# This dense layer will be applied to each `time`.
dense_layer = layers.Dense(
units=4,
use_bias=False,
kernel_initializer=tf.keras.initializers.Constant(value=0.5)
)
# Initialize the `TimeDistributed` layer.
timedistributed = layers.TimeDistributed(dense_layer)
# What should be the expected output?
print(timedistributed(inputs))
>>> tf.Tensor(
[[[1.5 1.5 1.5 1.5]
[3. 3. 3. 3. ]]], shape=(1, 2, 4), dtype=float32)
Ain't the output the same as the one in the previous section? This makes TimeDistributed(Dense(...)) and Dense(...) equivalent to each other in this scenario.

In Practice

Great! Now we know how to use the Keras Dense layer correctly. So let's experiment with this layer to understand it more intimately. We'll keep things simple and use a synthetic dataset created using Scikit Learn's make_classification. The dataset will have ten features and ten classes. The features are not normalized to see how well a neural network can extract features without any help.
The dataset will remain constant with 10000 samples. We'll train our model for 100 epochs with a batch size of 256.

Check out the Colab Notebook \rightarrow

Our fully connected neural network looks like this:
def MLPModel():
inputs = layers.Input(shape=(10,))
x = layers.Dense(configs["hidden_units"], activation="relu")(inputs)
outputs = layers.Dense(10, activation="softmax")(x)

return models.Model(inputs, outputs)

Effect of Parameters - units

The first argument in the Dense layer is units that can control the number of neurons in that layer. The number of units will be 10, 100 or 1000. The panels below show the result of using different units.
Clearly, by increasing the number of parameters (units) improve the model.

Run set
3

If you look at the training and validation loss, the one with 1000 units starts to overfit quickly, while the one with only 10 neurons has not even started to diverge.

Run set
3


Effect of Kernel Regularization

Can we counter this overfitting without lowering the capacity of the model? Enters kernel regularization, allowing you to apply layer parameter penalties during optimization. We will not go into the details and try to see the effect in action.
We will compare the model trained with 1000 hidden units without any regularization with models trained with the same configuration but with kernel_regularization. We will use the L2 regularization penalty, which takes an l2 argument. It's a float value in the [0-1] range. A value close to 0 means no regularization.
In the previous section, we noticed that the model trained without regularization started overfitting soon. Let's see the effect of different l2 values and regularization in general.

Run set
4

Notice the divergence between loss and val_loss for the no-reg run. It has the most divergence indicating overfitting, while the reg-0.1 run diverged the least. How does regularization impact the final accuracy of the model?

Run set
4

L2 regularization with an l2 value of 0.1 didn't help the model achieve the same level of val_acc, even though it helped counter overfitting. However, one can thus train a regularized model for a longer time. In practice, we usually regularize with an l2 value of 1e-3.

Conclusion

This quick report shows how a fully connected layer can be initialized and created using Keras. We also did some comparative studies to understand the usage even better.
The Dense layer is a foundational layer in deep learning and is usually used in attention layers, MLP blocks, projectors, etc. With dense connections, the number of parameters is high, making it non-ideal for directly processing high-dimensional inputs like images.
If you have a more complicated use case, consider using the EinsumDense layer instead. It uses einsum expressions, and Dense is a special case of EinsumDense. We will cover this in another report shortly!
Iterate on AI agents and models faster. Try Weights & Biases today.