Audio source separation

Created on January 7|Last edited on January 27
Comment
﻿
Project details﻿MVA 2023 , ENS Paris Saclay | Deep and Signal processing | Code﻿
Project: Balthazar Neveu |  Mathilde Dupouy﻿
﻿
ContextAudio source separation allows to recover several signals from a mixed signal.
The use of deep neural networks allows to perform the task by learning how to separate sources from training data in a supervised way: sources are synthetically combined (by addition) so groundtruth is known.
﻿Training and evaluation framework﻿Training 
Training is performed in a supervised fashion.
Training scripts allow to train locally or directly on Kaggle Notebooks directly from the command line (fully reproducible, dataset is stored under Kaggle to ease storage).
Experiments are fully versioned under git (an experiment ID tracks the model architecture aswell as training configuration - optimizer and dataloader configuration)
Experiments results tracking is performed under Weights and Biases ﻿﻿﻿﻿﻿﻿﻿﻿﻿
We validate the training mechanism in two ways.
﻿pytests to check the unitary behavior of training loop and data chain.
Training a single scale convolutional model (FlatConv), using the L² loss (MSE) and the Adam optimizer.
Validation
To evaluate our results, we have implemented a GUI (graphical user interface) based on the interactive pipe library which allows to navigate along the test signals, visualize waveforms and play the audio. The GUI allows running the inference live on the local GPU and even compare various models.
Exploring architectures
U-Net architectures2 main architectures working in the time domain ResUNet & WaveUNet.﻿﻿
Details on UNetResUNet
﻿Res-U-Net: Unpaired Deep Cross-Modality Synthesis with Fast Training
Coded from scratch (by looking at the architecture image from the paper). Inspired from one of the most famous architecture in computer vision and image restoration and simply adapted to audio.
In image processing, one would usually downsample by a factor of 2, therefore reducing the amount of pixels by 4 and increase the amount of channels by 2 (compress the information in the encoder by a factor of 2=4/2 at each stage).
For audio, we downsample by a factor of 2 and propose to extend the amount of channel by 1.5 in the encoder phase.
Uses Residual convolution blocks as a building block instead of naive convolutions. 
Res UNet (Unpaired Deep Cross-Modality Synthesis with Fast Training)
﻿
Wave UNet
﻿Wave-UNet: Dedicated to audio separation﻿﻿
﻿
﻿
﻿
Wave UNet dedicated to audio
When the amount of parameters goes above 1M, roughly speaking, we start overfitting.
﻿
Run set8
﻿
﻿
project("teammd", "audio-separation").artifact("test_noise_0").membershipForAlias("v0").artifactVersion.file("").path("mix_snr_-4.wav")
wandb-artifact:///teammd/audio-separation/test_noise_0:26a15665053cae651f50/mix_snr_-4.wav
﻿
﻿
Run set5
﻿
﻿
Add a comment
Audio source separation

Project details

Context

﻿Training and evaluation framework﻿

Exploring architectures

U-Net architectures

Details on UNet

Training and evaluation framework