The Reality Behind Optimization of Imaginary Variables - II

Discovering the types of complex-valued neural networks, their initialization techniques, activations and convergence factors for efficient complex variable optimization
Darshan Deshpande
Created on September 5|Last edited on June 7
Comment
To read part one of this series, please head here.
IntroductionIn this article, we will examine the impact of complex representations and the operations that are applied to them by comparing them to their real-valued counterparts and obtaining significant improvements over the latter. The fundamental purpose of this report is to promote the use and study of imaginary representations by demonstrating the ease with which the mathematics is applied and the vast range of domain flexibility available. 
In the previous report, we discussed the mathematics governing complex variable optimization and how, in some cases, complex optimization can be advantageous, providing faster convergence than linearly stacked real-valued layers. In this section, we'll look at how this optimization technique can be extended to more useful operations like convolutions as well as explore the influence of network initialization procedures, activations and loss functions on complex convergence.
The Importance of PhaseAll supervised machine learning problems have the same objective: to find a function that maps distributions. The contemporary definition of this mapping involves training a set of weights that can align the input to a certain real-valued output distribution. So, why bother introducing a completely new dimension of phase when the amplitude (representing the energy) of a signal or pixel is adequate for this mapping? There are several reasons for doing so. 
The introduction of phase, for example, assists in understanding the temporal position of a wave at any given time when processing signals. This information can not only help us compare signals or images but also provides intelligibility to most signals present today (Shi et al. 2006). In the field of image processing, it has been discovered that solely the phase of an image is sufficient to capture it's texture and saturation, and reconstruct its structure and orientation (Oppenheim and Lim, 1981). In many circumstances, the phase's introduction of rotational features for complex numbers aids in faster optimization by reducing degrees of freedom (Akira Hirose, 2013).
﻿
As seen in the figure above, instead of the two (at least) linear translations required when utilizing real variable mapping, a complex vector mapping can be accomplished with a simple rotation operation by adjusting the phasor angle. This reduces the degrees of freedom as well as the computation costs required.
 Traditional Complex Neural NetworksIn traditional models, a complex neuron was implemented similar to a real-valued neuron where the only difference was the way the matrix multiplication was handled. This method was widely used because most frameworks did not support complex gradients. Variables have a very specific connotation when it comes to the imaginary domain and so the complex multiplicative operation is different from a real vector multiplication as it changes not only the magnitude but also the phasor angle. If we want to manipulate complex representations using real operations, for example linearly mapping some input x∈R2x \in \mathbb R^2x∈R2﻿ to an output y∈R2y \in \mathbb R^2y∈R2﻿, we will require a weight matrix W∈R2×2W \in \mathbb R^{2 \times 2}W∈R2×2﻿. This is problematic because there will be more trainable parameters than inputs or outputs for this linear model which will make the network underdetermined (Akira Hirose, 2013). Hirose et al. proposed that whenever necessary, the complex weight W(CM×N)W (\mathbb C^{M \times N})W(CM×N)﻿ can be represented as a real-valued matrix as:
W=[Wr−WiWiWr]W = \begin{bmatrix}
W_r & -W_i \\
W_i & W_r 
\end{bmatrix}W=[Wr​Wi​​−Wi​Wr​​]﻿
This matrix form was useful when representing complex calculations with real operations and was necessary till Wirtinger's calculus found its way into contemporary libraries. Building on top of this adaptation, Chiheb Trabelsi﻿ et al. proposed their complex weight initialization techniques to allow for better convergence. These initialization techniques will be discussed later in this article.
The Fundamentals of C→C\mathbb C \to \mathbb C C→C﻿ and C→R\mathbb C \to \mathbb RC→R﻿ convergenceThe interconversion of the real and imaginary domains is a major convenience when working with complex representations of real data. Common machine learning tasks such as classification necessitate the generation of probability distributions corresponding to the ground-truth labels, which requires C→R\mathbb C \to \mathbb RC→R﻿ mapping when working with complex neurons. This is where the interchangeability of imaginary numbers comes into play. It is possible to convert any complex vector into a real-valued vector by taking its magnitude.
zmag=∣a+ib∣∈Rz_{mag} = |a+ib| \in \mathbb Rzmag​=∣a+ib∣∈R﻿
This real-valued representation can be passed through a standard non-linearity like the sigmoid function to generate the required label probabilities. 
Most people, however, are aware of how and why this casting occurs. The real issue arises when we convert a real vector to a complex representation (R→C\mathbb R \to \mathbb CR→C﻿). This can be approached in one of two ways.
The first is a vector projection utilizing Fourier transformations in which the amplitudes or energies of a signal or image are projected onto the frequency domain. 
∫−∞∞f(x)e−2πikxdx\int_{-\infty}^\infty f(x)e^{-2\pi ikx}dx∫−∞∞​f(x)e−2πikxdx﻿
Fourier transformations play a significant role in signal analyses and are useful for obtaining complex-valued features from image and speech data.
Another method, as shown by Chiheb Trabelsi﻿ et al., is to let the model learn this conversion. The basic idea is to transform a real vector to an imaginary vector by setting the imaginary coordinates to zero. Then this newly formed vector is propagated through a number of fixed complex convolution and batch normalization layers that output the appropriate complex features.
x→x+0j→CNN+BN  layers→Complex  featuresx \to x+0j \to CNN + BN \; layers \to Complex \; featuresx→x+0j→CNN+BNlayers→Complexfeatures﻿
While the second method has been effective for certain image processing tasks, it has not been tested on all domains and so the FFT method is ubiqutuous when undertaking such transformations.
Now that we know the difference between  C→C\mathbb C \to \mathbb CC→C﻿ and C→R\mathbb C \to \mathbb RC→R﻿ optimization and techniques to perform this mapping, we are ready to explore the layers that build these networks. But before we can dive into those, we should discuss two types of kernel implementations which are actively researched: strictly linear and widely linear networks.
Strictly Linear vs Widely Linear NetworksA strictly linear or a complex linear network is the well-acknowledged method of performing matrix multiplication for forward and using the Wirtinger's derivative for backward propagation on a real-valued loss. This is what we have discussed in the first part of the series and can be represented as:
y^=σ(Wx+b)W,b∈C;  σ  is  a  complex  activation  function\hat{y} = \sigma(W  x + b)
\\
W, b \in \mathbb C; \; \sigma \;is \;a \;complex\; activation \;function \\y^​=σ(Wx+b)W,b∈C;σisacomplexactivationfunction﻿
This means that if the operation is performed using a real-valued operations as described in the previous section, the parameters will double (N→2NN \to 2NN→2N﻿).
Widely linear networks on the other hand are special projections of complex layers that are used for specific tasks. The primary distinction between a strictly linear and a widely linear network is that the former is holomorphic and the latter is not. This gives them a special advantage of fitting to second-order non-circular data (Zeyang Yu, Shengxi Li and Danilo Mandic, 2019). 
A complex random variable ZZZ﻿ is said to be circular if for any rotation angle ϕ\phiϕ﻿ both ZZZ﻿ and ZeiϕZe^{i\phi}Zeiϕ﻿, have the same probability distribution. This means that second order circular signals are invariant to phase transformations (Danilo Mandic and Vanessa Su Lee Goh, 2009). In actuality, the availability of such circular distributions is rare and most distributions are non-circular. This is why widely linear neural networks were suggested as a practical solution.
A widely linear network uses conjugates of the imaginary inputs along with two weight matrices, one for the standard and the other for the conjugate representation of the input.
y^=σ(W1x+W2x‾+b)W1,W2,b∈C;  σ  is  a  complex  activation  function\hat{y} = \sigma(W_1  x + W_2  \overline{x} + b)
\\
W_1, W_2, b \in \mathbb C; \; \sigma \;is \;a \;complex\; activation \;functiony^​=σ(W1​x+W2​x+b)W1​,W2​,b∈C;σisacomplexactivationfunction﻿
As compared to strictly linear layers, widely linear layers introduce an additional real weight matrix that quadruples (N→4NN \to 4NN→4N﻿) the number of overall trainable parameters.
Despite advances in widely linear networks, the applicability of Cauchy-Reimann equations and low computational requirement of strictly linear networks distinguishes and popularizes them. 
In the following sections, we'll explore how to implement certain routinely used layers, initializers, activations, and losses involving complex variables.
Initialization TechniquesThe initialization methods for complex neural networks greatly differ from standard neural networks. To understand how the phase is incorporated in the weight distribution, we will recap how real-valued initializers work. Let us discuss two of the most commonly used initialization techniques: He (He et al., 2015) and Glorot﻿ (Xavier Glorot, Yoshua Bengio, 2010) initializations.
The aforementioned initializers involve generating a random normal distribution according to the standard deviations as follows:
σHe=2fanin\sigma_{He} = \sqrt{\frac{2}{fan_{in}}}σHe​=fanin​2​​﻿
σGlorot=2fanin+fanout\sigma_{Glorot} = \sqrt{\frac{2}{fan_{in} + fan_{out}}}σGlorot​=fanin​+fanout​2​​﻿
To translate this initialization to the complex domain, we will need to ensure that the phase is adequately incorporated.
In general, complex weight initializations are expected to have a polar form as below: 
W=∣W∣eiθ=∣W∣cosθ+i∣W∣sinθW = |W|e^{i\theta} =|W|cos\theta + i|W|sin\thetaW=∣W∣eiθ=∣W∣cosθ+i∣W∣sinθ﻿
Here, θ\thetaθ﻿ and ∣W∣|W|∣W∣﻿are the phase and magnitude of the complex weight matrix respectively. 
To ensure that the standard deviation of the generated distribution conforms with this form, we must determine the variance of the distribution after which we can calculate the deviation. 
Variance is the difference between the square of magnitude and square of expected value which is defined as :
Var(W)=E[∣W∣2]−(E[W])2Var(W) = \mathbb{E}[|W|^2] - (\mathbb{E}[W])^2Var(W)=E[∣W∣2]−(E[W])2﻿
Here, we don't know the value of Var(W)Var(W)Var(W)﻿ but we know the value of Var(∣W∣)Var(|W|)Var(∣W∣)﻿ because the value of ∣W∣|W|∣W∣﻿ (magnitude of the weight vector) follows the Rayleigh distribution (Chi-distribution with two degrees of freedom) (Chiheb Trabelsi﻿ et al, 2018).
Var(∣W∣)=E[∣W∣2]−E[∣W∣]2Var(|W|) = \mathbb{E}[|W|^2] - \mathbb{E}[|W|]^2Var(∣W∣)=E[∣W∣2]−E[∣W∣]2﻿
Combining the two equations above, we get:
Var(W)=Var(∣W∣)+(E[∣W∣])2Var(W) = Var(|W|) + (\mathbb{E}[|W|])^2Var(W)=Var(∣W∣)+(E[∣W∣])2﻿
Here, both the variance and expectation of the magnitude are only dependent on the Rayleigh distribution's mode (σ\sigmaσ﻿).
E[∣W∣]=σπ2    (Mean)E[|W|] = \sigma \sqrt{\frac{\pi}{2}} \;\;(Mean)E[∣W∣]=σ2π​​(Mean)﻿
Var(∣W∣)=4−π2  σ2Var(|W|) = \frac{4-\pi}{2} 
\;\sigma^2Var(∣W∣)=24−π​σ2﻿
Hence, 
Var(W)=4−π2  σ2+(σπ2)2=2σ2Var(W) = \frac{4-\pi}{2} 
\;\sigma^2 + (\sigma \sqrt{\frac{\pi}{2}})^2 = 2 \sigma^2Var(W)=24−π​σ2+(σ2π​​)2=2σ2﻿
In this case, if we want to implement the initialization proposed by He et al., then Var(W)=2/faninVar(W) = 2/fan_{in}Var(W)=2/fanin​﻿ which means that the value of σ\sigmaσ﻿ must be set to 1/fanin1/\sqrt{fan_{in}}1/fanin​​﻿.
Similarly for the initialization proposed by Xavier Glorot and Yoshua Bengio, Var(W)=2/(fanin+fanout)Var(W) = 2/(fan_{in} + fan_{out})Var(W)=2/(fanin​+fanout​)﻿. This means that σ=1/fanin+fanout\sigma = 1/\sqrt{fan_{in} + fan_{out}}σ=1/fanin​+fanout​​﻿﻿
We can see that the variance of W only depends on the magnitude and not the phase. To initialize the phase, we sample a uniform distribution between −π-\pi−π﻿ and π\piπ﻿. To get the final weights, we simply multiply the magnitude by the phasor as
W=∣W∣sinθ+i∣W∣cosθW = |W|sin\theta + i |W|cos\thetaW=∣W∣sinθ+i∣W∣cosθ﻿
Let us analyze the code to see how this can be implemented using Tensorflow. Note that this implementation is similar to the code proposed by Chiheb Trebelsi et al. with only one difference that they use the real matrix approach instead of the complex weights like we do.
import tensorflow_probability as tfp
﻿
def compute_fans(shape):
  receptive_field_size = 1
  # Calculate the receptive field for depth channels
  for dim in shape[:-2]:
    receptive_field_size *= dim
  fan_in = shape[-2] * receptive_field_size
  fan_out = shape[-1] * receptive_field_size
  return int(fan_in), int(fan_out)
﻿
﻿
def complex_glorot(shape, dtype=None):
  # Compute the input and output nodes
  fans_in, fans_out = compute_fans(shape)
﻿
  # Defining the scale for Glorot distribution
  s = 1./(fans_in + fans_out) 
  
  # Generate Rayleigh distribution with the scale
  modulus = tfp.random.rayleigh(scale=s, shape=shape)
﻿
  # Generating the phase between [-pi, pi]
  phase = tf.random.normal(shape, -math.pi, math.pi)
﻿
  # Generating the weights by multiplying the phase with magnitude
  # This is defined in the equation as |W|e^(i x theta) where 
  # e^(i x theta) = cos theta + i sin theta
  w_real = modulus * tf.math.cos(phase)
  w_imag = modulus * tf.math.sin(phase)
﻿
  # Creating a complex vector using the real and imaginary parts
  weights = tf.complex(w_real, w_imag)
  return weights
Convolutional LayersBecause of their versatility, convolution techniques are frequently employed in all machine learning areas for feature extraction.To speed up the operation, Fast Fourier Transformations (FFTs) and Inverse FFTs are used for most convolutions in modern libraries for real-valued networks.
IFFT(FFT(W)⋅FFT(x))IFFT(FFT(W) \cdot FFT(x))IFFT(FFT(W)⋅FFT(x))﻿
This is possible because convolution in the time domain is equivalent to point-wise multiplication in the frequency domain. 
A standard convolution operation leads to general output channels for strided, padded convolution given by the following formula: 
noutput=⌊ninput+2p−ks⌋+1n_{output} = \lfloor{\frac{ n_{input} + 2p -k}{s}}\rfloor + 1noutput​=⌊sninput​+2p−k​⌋+1﻿
For extending this operation to the complex domain, we will need to ensure that this shape is persisted for complex convolution. (Chiheb Trabelsi﻿ et al, 2018) proposed that a convolution operation is possible by manipulating discrete real domain convolutions for each axis in the complex tensor. 
Source: Deep Complex Networks (Chiheb Trabelsi﻿ et al, 2018)
This is made possible by the distributive nature of the convolution operation.
W∗x=(a∗x−b∗y)+i(b∗x+a∗y)W * x = (a*x-b*y)+i(b*x+a*y)W∗x=(a∗x−b∗y)+i(b∗x+a∗y)﻿
Such a convolution strategy is chosen because it eliminates having to implement the operation from scratch for different accelerator support. Chiheb Trabelsi﻿ et al. use a single convolution operation for every batch by leveraging the orthogonal real matrix approach but for clarity, we will use four discrete convolution operations. The idea remains the same for the two.
Let's take a look at how this operation is implemented in Tensorflow.
class ComplexConv2D(tf.keras.layers.Layer):
  def __init__(self, filters, kernel_size=(5,5), strides=1, padding='same', data_format='channels_last', dilation_rate=1, kernel_initializer='glorot_uniform', use_bias=True, **kwargs):
    super().__init__(dtype=tf.complex64, **kwargs)
    self.filters = filters
    self.kernel_size = (kernel_size, kernel_size) if isinstance(kernel_size, int) else kernel_size
    self.strides = strides
    self.padding = padding
    self.dilation_rate = dilation_rate
    self.use_bias = use_bias
    self.kernel_initializer = kernel_initializer
  
  def build(self, input_shape):
    self.kernel = self.add_weight(shape=self.kernel_size + (input_shape[-1], self.filters), initializer=self.kernel_initializer, trainable=True)
    if self.use_bias:
      self.bias = self.add_weight(shape=(input_shape[-3], input_shape[-2], self.filters), initializer=self.kernel_initializer, trainable=True)
  
  def call(self, input):
    real, imag = tf.math.real(input), tf.math.imag(input)
    kernel_r, kernel_i = tf.math.real(self.kernel), tf.math.imag(self.kernel)
﻿
    real_real = tf.nn.conv2d(real, kernel_r, self.strides, self.padding.upper(), dilations=self.dilation_rate)
    real_imag = tf.nn.conv2d(real, kernel_i, self.strides, self.padding.upper(), dilations=self.dilation_rate)
    imag_real = tf.nn.conv2d(imag, kernel_r, self.strides, self.padding.upper(), dilations=self.dilation_rate)
    imag_imag = tf.nn.conv2d(imag, kernel_i, self.strides, self.padding.upper(), dilations=self.dilation_rate)
    output = tf.complex(real_real - imag_imag, real_imag + imag_real)
    if self.use_bias:
      return output + self.bias
    else:
      return output
When compared to a real-valued convolutional layer, the number of convolution operations quadruples in the above implementation. This can easily be reduced to a single operation by following the paper implementation as mentioned before.  Through all my custom experiments, this method of convolution is more robust to overfitting than convolving on double-dimensioned or two separate real tensors.
ActivationsNon-linearities must be introduced into the network to avoid exploding or vanishing gradients. It is not obligatory for activations to be holomorphic, and using strictly holomorphic functions can sometimes limit the use of potential activation functions but since backpropagating on holomorphic functions is generally less computationally expensive, it is one of the major reasons why they are preferred by most researchers. The only restriction on the choice of a complex activation function is its differentiability with both real and imaginary components. 
Split real-imaginary hyperbolic tangent function (﻿Andy M. Sarrof, 2018)
Let us see two implementations of ReLU for complex optimization.
1. CReLU
This is the standard split activation approach where the ReLU function is applied independently to the real and imaginary domains. CReLU satisfies the holomorphic condition since it fulfills the Cauchy Reimann checks when θz∈]0,π/2[\theta_z \in ]0, \pi/2[θz​∈]0,π/2[﻿ or θz∈]π,3π/2[\theta_z \in ]\pi, 3\pi/2[θz​∈]π,3π/2[﻿. 
class CReLU(tf.keras.layers.Layer):
  def __init__(self, **kwargs):
    super().__init__(**kwargs)
  
  def call(self, input):
    return tf.complex(tf.nn.relu(tf.math.real(input)), 
                      tf.nn.relu(tf.math.imag(input)))
This method of applying non-linearities is the most widely used due to its fast computation and easier implementation using pre-existing libraries.
2. modReLU 
This version of ReLU was proposed by (Arjovsky, 2015) and involves the introduction of a dead zone around the origin of the imaginary plane where the neuron becomes inactive.
ReLU(∣z∣+b)eiθ={(∣z∣+b)z∣z∣∣z∣+b≥00otherwiseReLU(|z| + b)e^{i\theta} = \left\{
        \begin{array}{ll}
            (|z|+b)\frac{z}{|z|} & \quad |z|+b \geq 0 \\
            0 & \quad otherwise
        \end{array}
    \right.ReLU(∣z∣+b)eiθ={(∣z∣+b)∣z∣z​0​∣z∣+b≥0otherwise​﻿
Here, bbb﻿ (bias) is a learnable parameter that introduces the dead zone. This design was proposed because the application of ReLU separately to both domains performed poorly on simple tasks. modReLU keeps the phase information (θz\theta_zθz​﻿) unaltered since changing it hampers with the complex representation. The only drawback of this method of applying non-linearity is that it is non-holomorphic, which means that it is more computationally expensive and won't be suitable for some specific complex mapping operations. The implementation captures the essence of ReLU and only adds a single trainable parameter for determining the inactive region (much like Parametric ReLU, He et al., 2015).﻿﻿
σmodReLU(z)=σReLU(∣z∣+b)z∣z∣\sigma_{modReLU}(z) = \sigma_{ReLU}(|z| + b)\frac{z}{|z|}σmodReLU​(z)=σReLU​(∣z∣+b)∣z∣z​﻿
class ModReLU(tf.keras.layers.Layer):
  def __init__(self, **kwargs):
    super().__init__(**kwargs)
  
  def build(self, input_shape):
    self.b = self.add_weight(shape=input_shape[1:], 
                             initializer=tf.keras.initializers.zeros, 
                             dtype=tf.float32)
  
  def call(self, input):
    mag = tf.abs(input)
    relu = tf.nn.relu(mag + self.b)
    return tf.cast(relu, tf.complex64) * (input/tf.cast(mag, tf.complex64)) 
Losses and MetricsStandard losses such as Mean Squared Error work by default with complex networks but depending on the nature of the convergence, it may be more convenient to optimize on a real-valued loss function. For complex convergence, as explained in the previous part of this series, we will discuss two types of regression losses.
Mean Squared ErrorThe standard Mean Squared Error for the complex domain for a single layer can be represented as follows:
MSE=∑i=1N(yi−yi^)2MSE = \sum_{i=1}^N (y_i- \hat{y_i})^2MSE=∑i=1N​(yi​−yi​^​)2﻿
Here, yyy﻿ and y^\hat{y}y^​﻿ ∈C\in \mathbb C∈C﻿. Solving this complex-valued equation for a single node, we get
(y1−y2)2=[(a1+ib1)−(a2+ib2)]2=[(a1−a2)+i(b1−b2)]2=(a1−a2)2+2i(a1−a2)(b1−b2)+(b1−b2)2(y_1 - y_2)^2 = 

[(a_1+ib_1) - (a_2+ib_2)]^2
\\
\qquad \qquad = [(a_1-a_2)+i(b_1-b_2)]^2
\\
\qquad \qquad \qquad \qquad \qquad \qquad \qquad= (a_1-a_2)^2+ 2i(a_1-a_2)(b_1-b_2) + (b_1-b_2)^2 (y1​−y2​)2=[(a1​+ib1​)−(a2​+ib2​)]2=[(a1​−a2​)+i(b1​−b2​)]2=(a1​−a2​)2+2i(a1​−a2​)(b1​−b2​)+(b1​−b2​)2﻿
Note that the outcome here is a complex-valued loss, which is more computationally costly to compute. To address this, we can use the conjugate instead of explicitly squaring the error differences, as seen below:
(y1−y2)∗(y1−y2)‾=[(a1−a2)+i(b1−b2)]∗[(a1−a2)−i(b1−b2)]=(a1−a2)2+−i(a1−a2)(b1−b2)+i(a1−a2)(b1−b2)+(b1−b2)2=(a1−a2)2+(b1−b2)2(y_1 - y_2)* \overline{(y_1 - y_2)} =  [(a_1-a_2)+i(b_1-b_2)] * [(a_1-a_2)-i(b_1-b_2)]
\\
\qquad \qquad \qquad= (a_1-a_2)^2+ -\cancel{i(a_1-a_2)(b_1-b_2)} +\cancel{i(a_1-a_2)(b_1-b_2)} + (b_1-b_2)^2
\\
\qquad \qquad = (a_1-a_2)^2 + (b_1-b_2)^2(y1​−y2​)∗(y1​−y2​)​=[(a1​−a2​)+i(b1​−b2​)]∗[(a1​−a2​)−i(b1​−b2​)]=(a1​−a2​)2+−i(a1​−a2​)(b1​−b2​)​+i(a1​−a2​)(b1​−b2​)​+(b1​−b2​)2=(a1​−a2​)2+(b1​−b2​)2﻿
This is the more canonical representation of the MSE that we are used to seeing. The use of a conjugate-based loss function gives a real error value which can then be optimized using the standard Wirtinger's derivative approach. Let's look at the code for this loss:
def complex_mse(y_true, y_pred):
  difference = y_true - y_pred
  return tf.abs(tf.reduce_mean(difference * tf.math.conj(difference)))
Log errorThe conjugate method discussed above can be extended to the log error as well (Savitha, et al., 2013).
Log  error=∑i=0N(log(y)−log(y^))(log(y)−log(y^))‾Log \;error = \sum_{i=0}^N (log(y)-log(\hat{y})) \overline{(log(y)-log(\hat{y}))}Logerror=∑i=0N​(log(y)−log(y^​))(log(y)−log(y^​))​﻿
If yyy﻿ and y^\hat{y}y^​﻿ are expressed in their polar form as y=reiϕy = re^{i\phi}y=reiϕ﻿and y^=r^eiϕ^\hat{y} = \hat{r}e^{i\hat{\phi}}y^​=r^eiϕ^​﻿, the error function above boils down to a monotonically decreasing function
Log  error=12(log[r^iri]2+[ϕi^−ϕi]2)Log \; error= \frac{1}{2}(log[\frac{\hat{r}_i}{r_i}]^2 + [\hat{\phi_i} - \phi_i]^2)Logerror=21​(log[ri​r^i​​]2+[ϕi​^​−ϕi​]2)﻿
def complex_log_error(y_true, y_pred):
  difference = tf.math.log(y_true) - tf.math.log(y_pred)
  return tf.abs(tf.reduce_mean(difference * tf.math.conj(difference)))
Apart from these two complex backpropagation losses, commonplace classification tasks can be handled by taking the absolute value of the final layer to obtain a real representation which can then be backpropagated using the standard cross-entropy or any such similar error.
Experimenting with Complex Networks (Colab notebook)Now that we have seen the types of networks and layer implementations using Tensorflow, let us work on a simple example to see how these techniques are applied.
We will follow a step-by-step approach for coding a complex-valued image denoising autoencoder. 
Importsimport tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Activation, Lambda, Add
import tensorflow_probability as tfp
import matplotlib.pyplot as plt
import numpy as np
import wandb
import math
Apart from the usual imports needed for implementing models in Tensorflow, we will need the tensorflow-probability package. This will help us initialize weights later with respect to the Rayleigh distribution.
DatasetWe will be using the MNIST dataset for denoising. The dataset includes handwritten grayscale images along with their corresponding labels. In our case, we can safely ignore these labels. 
def get_mnist():
    """Retrieve the MNIST dataset and process the data."""
    # Set defaults.
﻿
    # Get the data.
    (x_train, _), (x_test, _) = tf.keras.datasets.mnist.load_data()
    x_train = x_train.astype('float32')
    x_test = x_test.astype('float32')
    x_train /= 255
    x_test /= 255
﻿
    noise_factor = 0.2
    x_train_noisy = x_train + noise_factor * tf.random.normal(shape=x_train.shape) 
    x_test_noisy = x_test + noise_factor * tf.random.normal(shape=x_test.shape) 
﻿
    x_train_noisy = tf.clip_by_value(x_train_noisy, clip_value_min=0., clip_value_max=1.)
    x_test_noisy = tf.clip_by_value(x_test_noisy, clip_value_min=0., clip_value_max=1.)
    x_train_noisy, x_train, x_test_noisy, x_test =  tf.expand_dims(x_train_noisy, -1), tf.expand_dims(x_train, -1), tf.expand_dims(x_test_noisy, -1), tf.expand_dims(x_test, -1)
﻿
    return tf.signal.fft2d(tf.cast(x_train_noisy, tf.complex64)), tf.signal.fft2d(tf.cast(x_train, tf.complex64)), tf.signal.fft2d(tf.cast(x_test_noisy, tf.complex64)), tf.signal.fft2d(tf.cast(x_test, tf.complex64))
﻿
﻿
x_train_noisy, x_train, x_test_noisy, x_test = get_mnist()
To create our training and testing sets, we must first apply Gaussian noise to our images. We will sample data points from a randomly normally distributed space and add them to the images to achieve this. We use a factor of 0.2 to limit the intensity of noise added to avoid complete saturation by random pixels. Once this noise intensity is adjusted, we can see that some pixels go beyond the allowed scope of [0,1] due to this addition operation. To fix this, we clip the noisy images between 0 to 1 using tf.clip_by_value. 
﻿
This clipped data is then projected onto the frequency domain using a two-dimensional Fourier transformation. This is done for both, the input image as well as the denoised label. Using this representation will allow us to use our complex layers and conjugate based regressive losses.
Defining the ArchitectureThe architecture used for denoising is a standard encoder-decoder architecture with skip connections. The use of skip connections prevents the gradients from exploding during training, thereby stabilizing the descent.
def get_model():
  inp = Input((28,28,1), dtype=tf.complex64)
﻿
  # Encoder
  cl1 = ComplexConv2D(16, 3, 1, 'same', kernel_initializer=complex_glorot)(inp)
  cl1 = CReLU()(cl1)
  cl2 = ComplexConv2D(32, 3, 1, 'same', kernel_initializer=complex_glorot)(cl1)
  cl2 = CReLU()(cl2)
  cl3 = ComplexConv2D(64, 3, 1, 'same', kernel_initializer=complex_glorot)(cl2)
  cl3 = CReLU()(cl3)
﻿
  # Decoder with residual connections
  cl4 = ComplexConv2D(64, 3, 1, 'same', kernel_initializer=complex_glorot)(cl3)
  cl4 = CReLU()(cl4)
  cl4 = Add()([cl3,cl4])
  cl5 = ComplexConv2D(32, 3, 1, 'same', kernel_initializer=complex_glorot)(cl4)
  cl5 = CReLU()(cl5)
  cl5 = Add()([cl2,cl5])
  cl6 = ComplexConv2D(16, 3, 1, 'same', kernel_initializer=complex_glorot)(cl5)
  cl6 = CReLU()(cl6)
  cl6 = Add()([cl1,cl6])
  out = ComplexConv2D(1, 3, 1, 'same', kernel_initializer=complex_glorot)(cl6)
  out = Add()([inp,out])
  out = ComplexConv2D(1, 1, 1, 'same', kernel_initializer=complex_glorot)(out)
  
  return tf.keras.models.Model(inp, out)
Each two-dimensional convolution layer uses a filter size of 3 and a stride of 1. Each convolution operation persists the input shape and is followed by a CReLU activation. This architecture is small and only consists of three layers of encoder and decoder layers respectively.  
Creating a Custom W&B Callback For LoggingSince we will be logging the denoised output to W&B after every epoch, we will implement a custom Keras callback that handles this. 
class WandBCallback(tf.keras.callbacks.Callback):
  def __init__(self, project='complex-optimization', run_name='complex-denoising'):
    wandb.init(project=project, name=run_name)
    self.inp = tf.expand_dims(x_test_noisy[0], 0)
    self.out = x_test[0]
  
  def on_epoch_end(self, epoch, logs):
    pred = tf.squeeze(self.model(self.inp), 0)
    ifft = tf.squeeze(tf.abs(tf.signal.ifft2d(pred)), -1)
    input_img = tf.squeeze(tf.math.abs(tf.signal.ifft2d(tf.squeeze(self.inp, 0))), -1)
    output_img = tf.squeeze(tf.math.abs(tf.signal.ifft2d(self.out)), -1)
﻿
    plt.figure(figsize=[20, 4.5])
    plt.subplot('131')
    plt.imshow(input_img, cmap='gray')
    plt.title("Noisy")
    plt.subplot('132')
    plt.imshow(ifft, cmap='gray')
    plt.title("Denoised")
    plt.subplot('133')
    plt.imshow(output_img, cmap='gray')
    plt.title("Original")
    
    wandb.log({"images": plt,
               "loss": logs['loss'],
               "epoch": epoch+1})
    plt.show()
  
  def on_train_end(self, logs):
    wandb.finish(0)
Here, we subclass the keras.callbacks.Callback class. The custom callback takes the project and the run name as input which is used to initialize a new W&B project and run respectively. The purpose of this class is to log the matplotlib plot and the training loss after every epoch which can then be visualized through the W&B interface later.
TrainingThe training code is the same as for any other Keras model with just one difference that we specify the conjugate version of the Mean Squared Error since this is a  C→C\mathbb C \to \mathbb CC→C﻿ type of convergence.
model = get_model()
﻿
model.compile(optimizer=tf.keras.optimizers.Adam(0.001), loss=complex_mse)
history = model.fit(x_train_noisy, x_train, batch_size=1024, 
			epochs=30, callbacks=[WandBCallback()])
We will train the model for 30 epochs with a batch of 1024 samples per step. Optionally, you can also experiment with learning rate schedules to aid faster convergence.
Results﻿
Run set1
﻿
The network sufficiently converges in 30 epochs for us to analyze its outputs. Let us do that next. You can see the outputs of all the epochs with the help of the slider in the following panel grid.
﻿
Run set1
﻿
We can instantly notice that the network can successfully capture and suppress the noise around the digit. In all my experiments with different model architectures, most complex-valued models were quite robust and no immediate overfitting was seen upto 8 layers in each encoder and decoder architectures respectively.
ConclusionIn this part we studied the practicalities of complex optimization wherein we touched upon the importance of phase and its relation to complex convergence. We discussed the strictly linear and widely linear networks and contrasted the traditional methods of weight initialization with contemporary initialization techniques while also going into the depth on how to extend pre-existing real-valued operations and layers to simulate complex operations. Finally, we explored an example for image denoising that levaraged complex representations and operations.
Active research in the field continues to expand our understanding of the subject. Researchers have been able to successfully involve complex representations in a variety of domains where they act as prospects for replacing attention mechanisms in Transformers(Lee-Thorp et al., 2021) to enhanced audio enhancement techniques (Andy M. Sarrof, 2018). 
This article is by no means comprehensive; rather, it seeks to give a gist of what remains undiscovered by the general populace interested in Machine Learning. I hope it helped you understand the importance of complex representations and how they can be used. If you have any questions or queries then feel free to reach out to me on Twitter and I will be happy to address them!
References﻿On the importance of phase in human speech recognition (Guangji Shi, M. M. Shanechi and P. Aarabi, 2006).
﻿The importance of phase in signals (A. V. Oppenheim and J. S. Lim, 1981).
﻿Complex-Valued Neural Networks: Advances and Applications (Akira Hirose, 2013).
﻿Deep Complex Networks (Chiheb Trabelsi﻿ et al., 2018).
 Widely Linear Complex-valued Autoencoder: Dealing with Noncircularity in Generative-Discriminative Models (Zeyang Yu, Shengxi Li and Danilo Mandic, 2019).
﻿Complex Valued Nonlinear Adaptive Filters (Danilo Mandic, Vanessa Su Lee Goh, 2009).
﻿On the Circularity of a Complex Random Variable (Esa Ollila, 2008).
﻿On circularity (Bernard Picinbono, 1994).
﻿Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (He et al., 2015).
﻿Understanding the difficulty of training deep feedforward neural networks (Xavier Glorot, Yoshua Bengio, 2010).
﻿﻿﻿Complex Neural Networks For Audio (Andy M. Sarrof, 2018).
﻿Unitary Evolution Recurrent Neural Networks (Arjovsky, 2015)
﻿Projection-Based Fast Learning Fully Complex-Valued Relaxation Neural Network,  (Savitha, et al., 2013).
﻿FNet: Mixing Tokens with Fourier Transforms (Lee-Thorp et al., 2021) 
﻿