The Reality Behind the Optimization of Imaginary Variables

Exploring the capabilities of neural networks to map the imaginary loss landscape and studying their applications in modern research.
Darshan Deshpande

Introduction

Complex numbers have been around for a long time, but despite their appealing properties and potential for enabling completely new and richer representations through the imaginary domain, complex-valued neural networks have been scarce due to the lack of basic layer infrastructure offered by popular frameworks like PyTorch, Tensorflow, etc. The primary motivation of this article is to inspire and encourage readers and researchers in the field to work on the heavily unexplored complex domain. Let's get started!

So what makes the complex domain so interesting?

An elegant pattern created by (x+iy)^3 in a polar plane
A complex number, as compared to a real number, is far more versatile as it carries more domain information. This is due to its ability to access the imaginary geometric plane and its distinct nature that supports almost all mathematical operations that can be performed on real numbers. In addition to this, complex functions are differentiable depending on their analytic and holomorphic nature (that's something we'll discuss later in this report).
Complex numbers have applications in a wide range of fields, including but not limited to signal processing, electrodynamics, fluid dynamics, and quantum mechanics. Neural networks in the field of Geoscience [3], MRI Image Segmentation [4], Speech Separation [5], and Image Processing [6] have shown noteworthy breakthroughs as compared to their real-valued counterparts. Major breakthrough papers on image processing on edge devices like [7] have shown to match or outperform binarized architectures on ImageNet by successfully using complex representations.
But before we can discuss the details of these papers, there are some primary questions that need to be answered. Namely: how do we train a neural net on these numbers? How does it perform the necessary backpropagation steps on complex vectors? And do all complex numbers have at least first order derivatives?
Let's explore the conditions for differentiating a complex number and how frameworks such as Pytorch and Tensorflow calculate complex gradients.

Cauchy Riemann Equations and Wirtinger's Derivative

1. Importance of Holomorphic functions

Every complex number can be separated into real and imaginary parts as shown below:
f(z) = f(x+iy) = u(x,y) + iv(x,y)
Here, f is a complex function with a complex mapping of the form \mathbb{C} \rightarrow \mathbb{C} and x, y, u(x, y), v(x,y) \in \mathbb R.
For a complex number to be differentiable in a complex plane, it has to be holomorphic in nature. In analyses of complex functions, the words analytic and holomorphic are used interchangeably, but in a nutshell, a holomorphic function is a function that is differentiable at every point on the complex plane with respect to its neighborhood.
Credits: complex-analysis.com
Since there are infinite directions from which we can approach a point z_0 on the imaginary plane (a few are shown above), there can be infinitely many derivatives of complex variables unlike the case of real variables which have to depend on the existence of f'(x) for f''(x) to exist!
The prerequisite for a complex variable to be differentiable is that it satisfies Cauchy Riemann conditions. Let us take a closer look at those next.

2. Cauchy Riemann Conditions

For a function f(z) = u(x,y) + iv(x,y) to be differentiable in an open subset of the complex plane, the total derivative of f with respect to z is dependent on the partial derivatives of f with respect to both of its components. Using the chain rule we can see:
\begin{aligned} \frac{\partial f}{\partial z} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial z} +\frac{\partial f}{\partial y}\frac{\partial y}{\partial z}\\= \frac{1}{2}(\frac{\partial f}{\partial x} - i\frac{\partial f}{\partial y})\\ = \frac{1}{2}[(\frac{\partial u}{\partial x} + i\frac{\partial v}{\partial x}) -i (\frac{\partial u}{\partial y} + i\frac{\partial v}{\partial y})]\\ =\frac{1}{2}[(\frac{\partial u}{\partial x} + i\frac{\partial v}{\partial x}) + (-i\frac{\partial u}{\partial y} +\frac{\partial v}{\partial y})] \end{aligned}
Now, along the real axis (x-axis), \frac {\partial f}{\partial y} = 0:
\begin{aligned}\frac{\partial f}{\partial z} = \frac{1}{2}(\frac{\partial u}{\partial x} +i \frac{\partial v}{\partial x})\end{aligned}
Similarly on the imaginary axis (y-axis) \frac {\partial f}{\partial x} = 0:
\begin{aligned}\frac{\partial f}{\partial z} = \frac{1}{2}(-i\frac{\partial u}{\partial y} + \frac{\partial v}{\partial y})\end{aligned}
Thus, by simultaneously equating the two equations for \frac{\partial f}{\partial z} above, we get the Cauchy Reimann conditions as follows:
\begin{aligned}\frac{\partial u}{\partial x} = \frac{\partial v}{\partial y}\\\frac{\partial u}{\partial y} = -\frac{\partial v}{\partial x}\end{aligned}
All complex functions need to satisfy this condition to be differentiable.
Although this approach can be used for the calculation of necessary backpropagation derivatives, the major issue here is that the generic differentiability of holomorphic functions is of little use for neural networks. This is because holomorphic functions are, in reality, difficult to find and so this makes the approach unreliable for backpropagation. This was till Wirtinger proposed his observation.

3. Wirtinger's derivative for backpropagation

Wirtinger noticed that even if f(z) isn't holomorphic, it could be re-written as f(z,\overline z) (where \overline z is the conjugate of z) which is always holomorphic in complex space. This makes f(z,\overline z) differentiable with partial derivatives \frac{\partial}{\partial z} and \frac{\partial}{\partial\bar z}. So, the real and imaginary parts can describe these partial derivatives as:
\begin{aligned} \frac{\partial}{\partial x} = \frac{\partial z}{\partial x}\frac{\partial }{\partial z} +\frac{\partial \bar z}{\partial x}\frac{\partial }{\partial \bar z}\\=\frac{\partial}{\partial z} + \frac{\partial}{\partial \bar z}\end{aligned}
\begin{aligned} \frac{\partial}{\partial y} = \frac{\partial z}{\partial y}\frac{\partial }{\partial z} +\frac{\partial \bar z}{\partial y}\frac{\partial }{\partial \bar z}\\=i(\frac{\partial}{\partial z} - \frac{\partial}{\partial \bar z})\end{aligned}
This can be re-written for \frac{\partial}{\partial z}and \frac{\partial}{\partial \bar z} as:
\begin{aligned}\frac{\partial}{\partial z}=\frac{1}{2}(\frac{\partial}{\partial x} - i\frac{\partial}{\partial y})\\\frac{\partial}{\partial \bar z}=\frac{1}{2}(\frac{\partial}{\partial x} + i\frac{\partial}{\partial y})\end{aligned}
Using f_x = -if_y as observed in the Cauchy Riemann conditions above, we get:
\begin{aligned}\frac{\partial }{\partial z} = \frac{1}{2} (\frac{\partial }{\partial x} - i\frac{\partial }{\partial y}) = \frac{1}{2} (-i\frac{\partial }{\partial y} - i\frac{\partial }{\partial y}) = -i\frac{\partial }{\partial y}\end{aligned}
and
\begin{aligned}\frac{\partial }{\partial \overline z} = \frac{1}{2} (\frac{\partial }{\partial x} + i\frac{\partial }{\partial y}) = \frac{1}{2} (-i\frac{\partial }{\partial y} + i\frac{\partial }{\partial y}) = 0\end{aligned}
This means that for complex-valued optimization, we must use \frac{\partial L}{\partial \bar z}instead of \frac{\partial L}{\partial z} since it can be optimized using real-valued objectives. Here, it is worth mentioning that optimizing \frac{\partial f}{\partial z} is also possible (this is what JAX does) but since we are dealing with Tensorflow for this tutorial and Tensorflow calculates the derivative of the complex conjugate, we will stick to using \frac{\partial f}{\partial \bar z}. Let us see how this can be extended for complex variable optimization.

4. Performing Gradient Descent

To start out, let us have a look at how backpropagation works for real variables for a simple regression example so that we can extend the logic to the complex domain. The standard method for a forward pass for real-valued weights is given as:
\begin{aligned}z = Wx^T+b\\ \hat y = \sigma(z)\end{aligned}
Here, \sigma represents an activation function. Applying chain rule to find the change in loss obtained by changing the weights by a small factor, \frac{\partial L}{\partial W}, is given by:
\begin{aligned}\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat y} \frac{\partial \hat y}{\partial z} \frac{\partial z}{\partial W}\end{aligned}
Similarly, change in loss obtained by changing the bias by a small factor, \frac{\partial L}{\partial b} is given by:
\begin{aligned}\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat y} \frac{\partial \hat y}{\partial z} \frac{\partial z}{\partial b}\end{aligned}
These gradients can be easily backpropagated as follows:
\begin{aligned}W = W - \lambda * \frac{\partial L}{\partial W}\\b = b - \lambda * \frac{\partial L}{\partial b}\\ \lambda \ is \ the \ learning\ rate \end{aligned}
The equations above explain the backpropagation of weights for a single node in the layer. While dealing with complex numbers, W isn't directly differentiable but can be differentiated with respect to its real and imaginary components, W_u and W_v as shown in the previous sub-section. So the equation for real-valued backpropagation just needs some minor adjustments to make it compatible with the complex domain.
\begin{aligned}W_u = W_u - \lambda * \frac{\partial L}{\partial W_u}\\W_v = W_v - \lambda * \frac{\partial L}{\partial W_v}\end{aligned}
Now, the equation for Wcan be re-written as:
\begin{aligned}W = (W_u - \lambda *\frac{\partial L}{\partial W_u}) + (W_v - \lambda * \frac{\partial L}{\partial W_v})\\= W - \lambda * (\frac{\partial L}{\partial W_u} + \frac{\partial L}{\partial W_v})\end{aligned}
Using\ Wirtinger's\ Equation:\\[0.1in]\begin{aligned}W=W - \lambda * \frac{\partial L}{\partial \overline W}\end{aligned}
Note that the \small1/2 from Wirtinger's equation is incorporated in \small \lambda and acts as a regularizer. This equation gives us our desired step for a real-valued loss function. A more in-depth analysis of the behaviour of this can be found in [1]. The most remarkable aspect of this method is that popular frameworks such as Pytorch and Tensorflow already use it to calculate their complex gradients. But if that is the case then the question still remains- why aren't complex neural networks more common?

Problems with current implementations

Regardless of their effectiveness, advanced audio and signal processing networks are limited by their inability to manipulate the complex domain effectively. Most neural networks trained for audio modulation or feature extraction only interact with real data in the form of absolute STFTs (spectrograms) or wavelets. These time-frequency domain representations of a signal lose information about the discrete amplitude and phase of the original signal, whose accurate reconstruction is extremely difficult.
The current best method of handling complex numbers using the dominant frameworks is to use dual kernels (weights) for each operation, one for each real and imaginary axes. It should be noted that this is a very different approach than the one discussed in the previous section and creates a completely different loss landscape. This method involves splitting the real and imaginary parts of the input tensor and stacking them or manipulating them differently using their own network architectures. For example, an output shape of a simple STFT can be given by (frames, frequency bins). This is a complex tensor where the real and imaginary values represent amplitude and phase information respectively. These parts are split and stacked up which changes the initial shape to (2, frames, frequency bins).
As the dimensionality of the feature space increases, the number of samples required to properly optimize increases exponentially. This problem is popularly known as the Curse of Dimensionality and is predominant while optimizing complex numbers by splitting kernel methods. Complex variable manipulation solves this issue by avoiding the extra dimension in the first place. In addition to this, most operations on real-valued functions can be extended to complex values so the code only needs to be minimally changed to achieve better optimization and generalization with considerably lesser training parameters.
Let us see a code example of how this can be done with the Tensorflow API.

Diving into the Code (Colab notebook)

For showing how efficient the complex-variable optimization method can be, we will compare the speed and generalizability of the two implementations on a simple mapping f(z): z \rightarrow 5z which is corrupted with random noise. Let's code a complex-valued regression example for this:
We start off by importing the tensorflow and random libraries. All mathematical operations will be done using tensorflow ops whereas the random class will help to add noise to our data and prevent overfitting. Optionally, you can set seeds for both of them.
import tensorflow as tfimport random# Setting seeds (Optional)random.seed(42)tf.random.set_seed(42)
Next, we define the hyperparameters like learning rate and the number of samples that we want to train on. Although it is possible to implement a custom function for Mean Squared Error, for the sake of simplicity, we will use the one offered by tf.keras since it supports complex variables out of the box.
# HYPERPARAMETERSNUM_SAMPLES = 1000lr = 0.1mse_loss = tf.keras.losses.MeanSquaredError()
Now we will create a helper function that produces NUM_SAMPLES number of dummy linear data points that satisfy the mapping function f(z): z \rightarrow 5z. Since this is an easy equation to fit, we add some random noise to every sample to make sure that the model doesn't overfit easily.
# Helper function to create dummy datadef create_dummy_data(num_samples): # f(x): x -> 5x x = tf.expand_dims(tf.constant([complex(i/num_samples,i/num_samples) for i in range(num_samples)], tf.complex128), -1) y = tf.expand_dims(tf.constant([complex(5*(j/num_samples) + random.random(), 5*(j/num_samples) + random.random()) for j in range(num_samples)], tf.complex128), -1) # Shape (1000,1), (1000,1) return x,y
Next up, we define a helper function that takes in the dummy train and validation datasets that we will create later and trains a simple regression model on it. The dataset is shuffled after every epoch to ensure generalization. Notice that we have used 128-bit complex numbers for our calculations. This is done because of the scale of the original data. You could experiment with a 64-bit version as well but more precision is always better while training with complex variables. The forward pass is a simple matrix multiplication operation. Note that we will not be using any bias tensor here since the task is just pure coefficient based scaling and does not require it. We then find the real-valued loss and calculate the Wirtinger's derivative (\frac{\partial L}{\partial \overline W}) using Tensorflow's GradientTape. Using this gradient we perform the backpropagation by updating the weights which marks the end of the epoch. This process is can be repeated for anywhere between 30-40 epochs.
# Helper function for complex optimizationdef train_complex(train_x, train_y, test_x, test_y, w): for i in range(31): # Shuffling at the start of every epoch indices = tf.range(start=0, limit=train_x.shape[0], dtype=tf.int32) shuffled_indices = tf.random.shuffle(indices) train_x = tf.gather(train_x, shuffled_indices) train_y = tf.gather(train_y, shuffled_indices) with tf.GradientTape() as tape: # Get y_pred. Linear activation is used for this example y_pred = tf.matmul(train_x, tf.cast(w, tf.complex128)) # Get real valued loss mse = mse_loss(train_y, y_pred) if i % 10 == 0: val_loss = mse_loss(test_y, tf.matmul(test_x, tf.cast(w, tf.complex128))) print(f"Training Loss at epoch {i}: {tf.abs(mse).numpy()}, Validation Loss: {tf.abs(val_loss).numpy()}") # Get gradients using Wirtinger's derivative dL_dwbar = tape.gradient(mse, w) # Apply raw backprop w.assign(w - dL_dwbar * lr) print("Training finished")
We now define the helper function for real-valued training. This approach is based on the split-kernel approach discussed before. We initialize two real-valued weights and perform matrix multiplication on the real and imaginary domains separately. The resultant loss is given by the summation of the two component losses. Note that this time the training loop is repeated anywhere between 60-70 times. According to exhaustive experiments, this is how long the weights take to fully converge.
# Helper function for real optimization (split kernel approach)def train_real(train_x, train_y, val_x, val_y, w_real, w_imag): # Splitting validation data into real and imaginary parts val_x_real, val_x_imag = tf.math.real(val_x), tf.math.imag(val_x) val_y_real, val_y_imag = tf.math.real(val_y), tf.math.imag(val_y) for i in range(61): # Shuffling at the start of every epoch indices = tf.range(start=0, limit=train_x.shape[0], dtype=tf.int32) shuffled_indices = tf.random.shuffle(indices) train_x = tf.gather(train_x, shuffled_indices) train_y = tf.gather(train_y, shuffled_indices) # Splitting real and imaginary parts from shuffled data x_real, x_imag = tf.math.real(train_x), tf.math.imag(train_x) y_real, y_imag = tf.math.real(train_y), tf.math.imag(train_y) with tf.GradientTape() as tape1, tf.GradientTape() as tape2: # Get y_pred for real and imaginary parts separately y_pred_real = tf.matmul(x_real, tf.cast(w_real, tf.float64)) y_pred_imag = tf.matmul(x_imag, tf.cast(w_imag, tf.float64)) # Calculate real valued losses mse_real = mse_loss(y_real, y_pred_real) mse_imag = mse_loss(y_imag, y_pred_imag) if i % 10 == 0: val_pred_y_real = tf.matmul(val_x_real, tf.cast(w_real, tf.float64)) val_pred_y_imag = tf.matmul(val_x_imag, tf.cast(w_imag, tf.float64)) val_loss = mse_loss(val_y_real, val_pred_y_real) + mse_loss(val_y_imag, val_pred_y_imag) print(f"Training Loss at epoch {i}: {mse_real.numpy() + mse_imag.numpy()}, Validation Loss: {val_loss.numpy()}") # Get separate gradients dL_dw_r = tape1.gradient(mse_real, w_real) dL_dw_i = tape2.gradient(mse_imag, w_imag) # Apply raw backprop on both components w_real.assign(w_real - dL_dw_r * lr) w_imag.assign(w_imag - dL_dw_i * lr) print("Training finished")
Finally, we create the dummy data and weight variables for both cases and begin training. The training and validation loss is printed every 10 epochs to help visualize the descent.
# Creating dummy datax, y = create_dummy_data(NUM_SAMPLES)# Train test split (80%-20%)train_x, test_x, val_y, val_y = x[:int(.8*NUM_SAMPLES), :], x[int(.8*NUM_SAMPLES):, :], y[:int(.8*NUM_SAMPLES),:], y[int(.8*NUM_SAMPLES):,:]# Initializing weightsw = tf.Variable(tf.zeros((1,1), tf.complex128), tf.complex128)# Training complex optimization exampletrain_complex(train_x, train_y, val_x, val_y, w)# Intializing two seprate kernels for real and imaginary domainsw_real, w_imag = tf.Variable(tf.zeros((1,1), tf.float64)), tf.Variable(tf.zeros((1,1), tf.float64))# Training real optimization exampletrain_real(train_x, train_y, val_x, val_y, w_real, w_imag)
Below is the graph comparing the two methods of optimization:
Some noteworthy inferences from the analysis of these training graphs are:
  1. Using complex variables allows much faster optimization while proving to be better or equally matched in terms of loss in most cases. The entire complex optimization training process takes about half the number of epochs to converge as compared to its real-valued counterpart.
  2. Complex variable optimization ensures better generalization. This is supported by the small but significant difference in the validation-set MSE scores of 0.16 and 0.17 for the complex and real valued models respectively.
  3. This method of training requires at least 2 \times lesser training parameters to converge and this difference in computation parameters is a major contributor to the faster training.

Conclusion

Machine learning in prominent research fields of Quantum and Signal analysis involving significant use of complex numbers is still in its infancy. The use of a complex domain AI will assist discoveries and future exploration in several areas like image and text representations, in addition to the ones listed above. In this article, we summarise the applications of complex numbers in the field of artificial intelligence and include a small implementation example in Tensorflow to demonstrate their effectiveness. More advanced examples will be covered in the next part of this series where we deep dive into the concepts of \mathbb C \rightarrow \mathbb C and \mathbb C \rightarrow \mathbb R convergence, alterations to activation functions and initializers, and the implementation of sub-classed tf.keras layers which support complex variables.

References

  1. The Complex Gradient Operator and the CR-Calculus (Ken Kreutz-Delgado, 2009).
  2. Deep Complex Networks (Chiheb Trabelsi et al., 2017).
  3. Complex-valued neural networks for machine learning on non-stationary physical data (Jesper Sören, Dramsch, Mikael Lüthje, Anders Nymark Christensen, 2020).
  4. Analysis of Deep Complex-Valued Convolutional Neural Networks for MRI Reconstruction (Elizabeth K. Cole and Joseph Y. Cheng and John M. Pauly and Shreyas S. Vasanawala, 2020).
  5. Complex Neural Spatial Filter: Enhancing Multi-channel Target Speech Separation in Complex Domain ( Rongzhi Gu, Shi-Xiong Zhang, Yuexian Zou, Dong Yu, 2014).
  6. A Complex-Valued VGG Network Based Deep Learing Algorithm for Image Recognition (S. Gu and L. Ding, 2018).
  7. BCNN: Binary Complex Neural Network (Yanfei Li et al., 2021).