Skip to main content

Audio source separation

Created on January 27|Last edited on January 27


Project details

MVA 2023, ENS Paris-Saclay | Deep and Signal processing | Code

Context

Audio source separation allows to recover several signals from a mixed signal.
The use of deep neural networks allows to perform the task by learning how to separate sources from training data in a supervised way: sources are synthetically combined (by addition) so groundtruth is known.
The following report describes our experiment to try to resolve a source separation on a specific dataset :
  • The problem is restricted to two sources ;
  • One source is the voice of a woman reading Jules Verne, considered in the following as the "clean" signal ;
  • The other source is the sound of a street, in which we can hear passer-by talking in the background, foot steps, cars... It will be called the "noisy" signal.
Although we call it an audio separation problem, it highly looks like a denoising problem with a specific type of additive noise (where traditional denoising uses additive white gaussian noise).

Run set
1

There's a whole more general field of study for audio source separation which does not assume the number or nature of sources.


Original dataset analysis

Initial dataset study and limitations

The provided dataset for the project contains mixed signals at given SNR. The mixed signals have all been normalized to have the same power.
Thus, as we can see on the figures below, both the clean signal and the noise signal are amplified then summed.
In a first time, we used MSE for the training loss, which is not robust to scaling. However, we observed that our models were overfitting and we questioned the task : we were comparing the network output amplitude to the clean one, while the clean signal is only up to a constant in the mixed signal. Thus, the easier way for the network to reduce the loss is to learn the training datasets amplitudes (by learning a mean amplitude ? phonemes ? only guesses have been made so far).


These figures show the variety in the noise, similarities in the clean signal and the effect of the SNR.
The power constraint also limits augmentations : scaling the input signal for the same SNR gives the same mixed signal thus the same input for the network. Furthermore, cropping the mixed signal, which is interesting because we prefer networks robust to the signal length in our case, modifies the power, and therefore the distribution is shifted.
We could have get around these issues by :
  • using SI-SNR as a loss, which is scale invariant. However, the output amplitude is not constrained and there is no certitude of what will the network predict. An amplification can be added at the end of the network to be in a reasonable order of magnitude ;
  • using methods robust to the scale, as Deep-Clustering where the output of the network is a spectrogram binary mask (and the supervision comes from masks).
To overcome this issue, we had to change the initial problem (see section The curse of MSE).

Further remarks

Outliers
The dataset also contains tracks that can be seen as outliers, like tracks with nearly no voice or tracks with music in the clean signal. We decided to let them in the dataset.
Simulation / Reality gap
This is a simplified model, there will necessarily be a gap when applied to natural signals. First caveat: we're mixing 2 audio sources which may have been captured with different microphones and inherently have a bit of noise. Secondly, the mixing process is a linear combination and most probably neglects a lot of physical effects. Last but not least, assuming we had a perfect audio combination, we are not simulating the imperfection of microphone acquisition.

Changing the problem

The curse of MSE loss + constant power mix!

The only way any neural network model can minimize the MSE in this setup is to memorize the signal.
This is what happened in our first round of experiments, when the network got enough parameters, it memorized the training signals. We can see in the following training charts that models with more than 1 million parameters (such as the WaveUNet with 7 scales with 2.3M parameters in red) starts to memorize the dataset (training loss decreased and the validation loss stopped decreasing (hopefully not increasing).


Run set
5



Choice of a new dataset

However, we instead decided to work with a dataset "live" mixed, where the power is not constrained and the SNR is randomly uniformly taken between a minimum and a maximum value.

These figures show that remixed signal have amplitudes of the order of magnitude of initial signals.
The clean signal amplitude in the given data is the initial amplitude. If we add a scaling augmentation, we expect the network to predict the amplitude of the scaled signal. It is a desired behaviour as we would like the output amplitude to be consistent with the input amplitude, that is to say is the locutor is "further" (less powerful signal, the denoising will keep the locutor signal power.
To ensure reproducibility and enable comparison between experiments, we implemented the dataset as follows :
  • For training, the SNR and the noise are randomly chosen ;
  • For validation, only the SNR are chosen randomly, using a manually set seed such that the combination are the same from one experiment to another.
Taking this decision, we assumed to re-train the previous experiment rounds (~75h runs for the constant power mix + MSE) on the new dataset.

Training and evaluation framework

Training
Training is performed in a supervised fashion.
  • Training scripts allow to train locally or directly on Kaggle Notebooks directly from the command line (fully reproducible, dataset is stored under Kaggle to ease storage).
  • Experiments are fully versioned under git (an experiment ID tracks the model architecture as well as training configuration - optimizer and dataloader configuration)
  • Experiments results tracking is performed under Weights and Biases 
We validate the training mechanism in two ways.
  • pytests to check the unitary behavior of training loop and data chain.
  • Training a single scale convolutional model (FlatConv), using the L² loss (MSE) and the Adam optimizer.
Validation
To evaluate our results, we have implemented a GUI (graphical user interface) based on the interactive pipe library, enabling navigation along the test signals, visualization of waveforms and playback of audio. The GUIfacilitates live inference on the local GPU and even compare various models.


Exploring architectures

FlatConvolutional "Baseline"

The baseline we implemented is called FlatConvolutional as it is a series of four convolutions with a given output dimension (hidden dimension) and activation functions, followed by a reduction of channel dimension to the desired output with two convolutions.

U-Net architecture

Two main architectures working with the time domain were proposed : ResUNet and WaveUNet.

ResUNet

See article Res-U-Net : Unpaired Deep Cross-Modality Synthesis with Fast Training
The ResUNet is before all based on a UNet architecture. The principle is roughly a series of convolutions, non linearities and downsampling pooling to learn features at different scales, then a similar network with upsampling pooling to work on this features to the targeted output. The based layer can be enhanced with a residual connection to enhance learning. The specificity of the ResUNet is skip connections : the copy and concatenation of the outputs of the layers at each level of the "encoding phase" to the "decoding phase", as illustrated in the image below from the article.
ResUNet (Unpaired Deep Cross-Modality Synthesis with Fast Training)
In our case, we re-implemented the architecture from scratch based on the architecture image. We made the following choices :
  • We fixed downsampling by a factor of two and proposed to extend the amount of channels by 1.5 in the encoder phase, as opposed to the common practice in image processing, where increasing channels by a factor of 2. This choice was made to keep a compression ratio larger than 1. In image processing, the widely used parameters are a downsampling by a a kernel of size two implying keeping one pixel out of four, and an increase of channels by a factor 2 (compression ratio 2=4/2).
  • We used the Residual convolution blocks as described in the image rather than naive convolutions.
  • The chosen hyperparameters associated with this architecture are the channel extension parameter (default to 1.5 as explained before), the size of the hidden dimension and the convolution kernel size.

Wave UNet

Wave-UNet : dedicated to audio separation
Wave UNet architecture
Here the architecture with is relies on a ResUNet architecture with its "U-form" to learn from different scale of the signal. It differs because of the choice of two different kernel sizes when downsampling (15) and upsampling (5). The concatenation with skip connections becomes necessary as summing is no longer possible. Here, the downsampling is a "simple" downsampling (i.e. not combined with a linear or non linear operation like mean or max pooling).
  • We chose to allow dropdown, i.e. to randomly remove, with a given probability,
  • The chosen hyperparameters associated with this architecture are the channel extension parameter, the kernels for downsample and upsample convolutions, the number of layers and the dropdown ratio.
A bias-free version of the WaveUNet has also be trained, leading to slightly better results. The bias-free trick has been used in image denoisers, the main point of discarding the bias is the outcome of a better generalization to a wide range of noise levels (for instance training on SNR of say [-4;4]dB may generalize better to -8dB signals when there's no bias).

Evaluation

Evaluation was made on different levels :
  • The performances of the models were followed during training on train and validation datasets using Wights and biases.
  • We built an interactive user interface (GUI) using interactive-pipe to plot and play the initial signals and the predicted output. This is essential as global metrics are not sufficient to describe the quality of the output in term of intelligibility and denoising quality. The interface enabled to do live inference, interactively switching between signals, models, and SNR.
  • We built a second interface using Plotly and Dash to visualize inference results in terms of input and output Signal-to-Noise Ratio (SNR) for various models and SNR ranges. It also enables to navigate to the saved output signal in the project file tree.


Results

Here is the performance of the main models :

Run set
25


Impact of the chosen loss (L2 vs L1)

L2 loss tends to produce oversmoothed results. Perceptually, using the L1 loss does not make a significant difference.
L2 loss favors smooth signals where L1 loss seems to be more faithful to the original signal.


L1 loss seems to avoid oversmoothing - Animation when decreasing the SNR from 10 to -10db

L2 loss favors oversmoothed signals - Animation when decreasing the SNR from 10 to -10db





Quantative analyzis

We observed no overfitting, on opposite with the experiments we made on the initial dataset (see section "The curse of MSE and constant power mix!"), so it is probably linked with the high diversity of the dataset because of the random choice of noises and SNRs.
We observe that the best performances in terms of average test SNR are with the WaveUNet architecture with a bias free training (the LR scheduler doesn't bring much). We also observe that WaveUNet and ResUNet (cf. green and purple colours) architectures have similar performances, and that the performances are almost proportional to the logarithm of the number of parameters.

Please keep in mind that the MSE is far from being a perfect quality metric. Having a low MSE does not mean that perceived quality is good. Several reasons imply that there's an average over the whole dataset where a lot of silence section bias the results (note: we have implemented a silence detector in our code for other reasons but it could have allowed to give lower weights to the silenced areas for evaluation).



Qualitative analyzis

Our best network at this point, WaveUNet 3.18M parameters achieves a SNR of 15.1dB on the validation set where the inputs are mixed with a SNR between -4dB and 4dB . It's clear that our network has not been able to surpass human hearing as even if we push SNR to -10dB, our brain is able to understand most of the text where the source separation is not able to recover all words in the signal. This may be due the knowledge of underlying language which helps a bit.
We observe that the network is able to remove most of the noise correctly. Unfortunately the voice sometimes seem unnatural. This is due to the residual noise.
The use of a perceptual loss (concept introduced in 2016 by Justin Jonhson to improve image restoration quality) may force the network to produce better perceptual quality.
Perceptual loss: In practice, further work could lead to using the intermediate features/embeddings of a pretrained audio detection model so that the embeddings of the predicted and groundtruth signals are close to each other - this is equivalent to minimizing a L2 loss in a perceptual space, it requires a bit more memory at training time).
A GAN loss may do exactly the same (but may be slightly more difficult and unstable to train)
💡
We provide 4 samples at different input SNR with 4 flavors of models (FlatConv , WaveUNet with and without bias and finally Bias free WaveUNet trained using L1).

Run set
4


Generalization capabilities:

We can try to push the input SNR out of the initial distribution and see how the network behaves. It's pretty impressive, performances do not totally fall off.
There's a limit though, our ear can still distinguish but the network can't. This could probably be pushed further (either by re-training on a larger range of input SNR or using an even larger model).

The word "chien" disappears when we push the input SNR down to -10db. Using the MSE loss may favor forcing the signal to zero rather than letting the network hallucinate something plausible.

Generalization of our model is limited, it does not work correctly with other voices (it works much worse with male voices), which is expected. This highlights the difficulty of making a generic audio denoising algorithm dedicated to voice isolation.


Conclusion

This project was a very interesting entry to the world of audio separation even though we only tackled the tip of the iceberg by tackling a denoising problem (not the classic AWGN) to restore a single voice.
We spent a fair amount of time evaluating the results and keeping a clean code rather than trying to improve metrics at all cost by searching the biggest network.
We discovered that audio quality assessment is a topic as difficult as the algorithms to improve audio. One of our biggest contribution here is a good framework for training and evaluating audio restoration models.
We didn't find a way to fix the un-natural artifacts of the reconstructed voice, investigating more time on tweaking the architecture may lead to better results or using a perceptual loss (but we'd most probably end up facing the perception-distortion tradeoff).





Appendix

The name gyraudio

We first started the project with the idea of assisting audio separation (for step removal in an action camera chesty video) using IMU (inertial measurement units). We were able to retrieve synchronized audio & IMU tracks using Gopro Max GPMF metadata (like an exif track for video).
Before we'd even go to try improving audio separation, you first need to solve the audio separation. Which we did instead of pushing the original topic.
💡


Code and training framework

  • MVA_pepites : remote training framework for kaggle, run a simple command line to access 30 hours of GPU limited to 12hours per session / week, host dataset directly on Kaggle.