Skip to main content

Audio source separation

Created on January 7|Last edited on January 27

Project details




Context

Audio source separation allows to recover several signals from a mixed signal.
The use of deep neural networks allows to perform the task by learning how to separate sources from training data in a supervised way: sources are synthetically combined (by addition) so groundtruth is known.

Training and evaluation framework

Training
Training is performed in a supervised fashion.
  • Training scripts allow to train locally or directly on Kaggle Notebooks directly from the command line (fully reproducible, dataset is stored under Kaggle to ease storage).
  • Experiments are fully versioned under git (an experiment ID tracks the model architecture aswell as training configuration - optimizer and dataloader configuration)
  • Experiments results tracking is performed under Weights and Biases 
We validate the training mechanism in two ways.
  • pytests to check the unitary behavior of training loop and data chain.
  • Training a single scale convolutional model (FlatConv), using the L² loss (MSE) and the Adam optimizer.
Validation
To evaluate our results, we have implemented a GUI (graphical user interface) based on the interactive pipe library which allows to navigate along the test signals, visualize waveforms and play the audio. The GUI allows running the inference live on the local GPU and even compare various models.

Exploring architectures

U-Net architectures

2 main architectures working in the time domain ResUNet & WaveUNet.

Details on UNet

ResUNet
Res-U-Net: Unpaired Deep Cross-Modality Synthesis with Fast Training
  • Coded from scratch (by looking at the architecture image from the paper). Inspired from one of the most famous architecture in computer vision and image restoration and simply adapted to audio.
  • In image processing, one would usually downsample by a factor of 2, therefore reducing the amount of pixels by 4 and increase the amount of channels by 2 (compress the information in the encoder by a factor of 2=4/2 at each stage).
  • For audio, we downsample by a factor of 2 and propose to extend the amount of channel by 1.5 in the encoder phase.
  • Uses Residual convolution blocks as a building block instead of naive convolutions.
Res UNet (Unpaired Deep Cross-Modality Synthesis with Fast Training)

Wave UNet
Wave-UNet: Dedicated to audio separation




Wave UNet dedicated to audio
When the amount of parameters goes above 1M, roughly speaking, we start overfitting.

Run set
8


wandb-artifact:///teammd/audio-separation/test_noise_0:26a15665053cae651f50/mix_snr_-4.wav


Run set
5

File<{extension: wav}>