Speaker Verification and GAN Face Detection

Deep learning generalization and gradient visualization techniques applied to audio speaker verification and GAN imagery detection.
Max L
Created on March 19|Last edited on April 5
Comment
This is an adaptation of a final report submitted by Max Lisaius and Bo Sullivan, with tasks selected by Brian Hutchinson for CSCI581 (Deep Learning) Winter 2021 at Western Washington University. Code was built with Pytorch-Lightning using Weights & Biases for logging, hyperparameter sweeps, and headless monitoring. Compute restrictions of 1x RTX 2080 and 4x GTX 1650s were in place to ensure a level playing field for all students, project timeline was 3 weeks, and no pre-trained models were allowed. 
AbstractFor this project we adapted and trained convolutional neural networks for audio speaker verification and detection of GAN-generated faces. For speaker verification we selected a modified AlexNet to train on slices of audio that we later converted into spectrograms. This led to an accuracy of approximately 94% on the provided validation data, performing beyond what random guessing or an inexpressive model could achieve. For GAN imagery detection we used a vanilla ResNet50 and got an accuracy of 99.9%, far
exceeding our expectations. This led us to begin investigating why the neural net was achieving such high accuracy out-of-the-box. Through examining misclassified data and saliency visualizations of the model, we were able to make hypotheses on what the model was using to make these predictions. In both tasks, we found that with the right data augmentation, regularization, and model selection, convolutional neural networks provided accurate predictions for the tasks provided.
Speaker Verification
Methods
Data Preprocessing / Feature ExtractionWe started by creating text files of filename pairs, where approximately half of the audio file pairs were the same speaker while the other half were two different speakers. We then passed these file pairs into our data loader, so that each pair is one index. When a data point is requested, the audio files are loaded in and clipped down to a size set in the hyperparameters (default 25000 samples). At first we always framed this clip to the center of the full audio, but after we observed a large overfitting problem on the training data set, we took this length from a random part of the audio file. We then took our pair of 1-D waveforms and converted them to audio spectrograms with torchaudio Spectrogram() transforms. After this we stacked 2-D audio spectrograms with the same dimensions depth-wise, such that each clip in the pair inhabited a unique channel.
Models DevelopedThe final model we went with was a modified AlexNet. We found that overall when no pre-trained models are allowed, AlexNet served as a good network to learn our channel layers and train quickly in a limited time. Once we added random shifting of the audio clips to the data loader, it greatly improved the model's ability to generalize. We also looked at models like a Siamese network with a contrastive loss function, ResNet, and VGG, but these networks did not perform as well as AlexNet, especially without pre-training
Training and TuningOur model yielded mild results upon our first successful run; the modified AlexNet got 72% validation accuracy and a test loss of 0.261. We decided to increase our training set count by generating more pairings and then using hard example mining by changing the 50/50 split in the training set to a 62.26/37.26 split for unalike and like pairings. We found that this improved accuracy significantly, along with some minor hand-tuning of other hyperparameters. For these hyperparameters like audio clip size, minibatch size, and learning rate, we utilized Weights & Biases Sweeps to accelerate our tuning and distribute sweeps across multiple compute nodes.
﻿
Sweep: d99wgrn115
﻿
BaselinesFor our testing, we kept the dev data splits at 50/50. Initial under-performing models resulted in poor metrics from 50/50 training set splits and a smaller training set sample size. We also tried training directly on waveform features, but we were unable to create a model that was expressive enough to learn from this form of data.  
ResultsOur model produced the best validation accuracy with a learning rate of 0.000044, minibatch size of 20, clip size of 30,000 samples, and Adam optimizer while being trained with a modified AlexNet. Our best results were 94.58% validation accuracy.
GAN Face Detection
Methods
Data Preprocessing/Feature ExtractionFor this task, we loaded each picture with PIL, converted it to a tensor, and returned it with a label.
Models DevelopedWe tried multiple models like AlexNet and VGG, but ResNet50 worked best. Even though each epoch takes almost an hour, ResNet50 achieved accuracies above 90% after the first epoch, with improvement falling off around 90-100 epochs.
Training and TuningResNet50 with a learning rate of 0.00015 and Adam optimizer yielded accuracy of over 99 percent after the first 10 epochs. 
﻿
.3
﻿
BaselinesIn our initial experiments with inexpressive models, we observed noisy accuracy oscillating between 45-55%. Originally we thought that this initial uptick to 55% was training occurring, however these turned out to be random guesses. After looking at the dev data provided, we noted the dev data consists of 6923 real and 8401 fake images, meaning that the split is 45.1% real and 54.9% fake. 
ResultsIn order to visualize why the model is performing so well we produced visualizations of the saliency, or gradients, of the neural network. While we cannot be certain, we can make some reasonable hypotheses using these provided maps.
The fake maps' activations are mostly even distributed on the faces and around the eyes and mouth. The real maps' activations are more uneven and dispersed. From this, we can gather that the model is picking up something about the facial features or their symmetry. In this case the neural net may be recognizing the artifacts or in-painting imposed by the generator of the original network that generated these images, suggesting that our model is modern compared to the discriminator used by the GAN that produced these images. Another possible reason for the abnormally high accuracies is the data itself, and that the data collection process should be reviewed for diverse GAN imagery, face-centering, and head pose, among other factors. 
Fake and Real Classified Faces, with Saliency Maps
We also wanted to investigate the pictures that the model guessed incorrectly, as these might provide insight into how the model is making these approximations.
We noticed a majority of the incorrect guesses are real images labeled as fake, and that the subjects are primarily women. This suggests that the data may contain bias in its representation of women or that misclassified images may have some post-processing applied to them. Another possibility is that these images were taken by professionals with manual focus lenses, creating a blurry background similar to ones produced in GAN-generated faces.
These factors may have contributed to the misclassification of these images as some of this post processing and photography work may accidentally mimic some of the artifacts that GANs can generate. However, if this is not the case, this may lead us to ask if some people have faces that may look more generated to one of these models than others? If so, there is the ethical consideration of whether these people may face disproportionate hardship in the creation of social media accounts or general digital identity verification  in the future.
Misclassified images with their target, prediction, and saliency maps
﻿
﻿
Run set2
﻿
﻿
Add a comment
Tags: Intermediate, Audio, Computer Vision, GenAI, Speech Recognition, PyTorch Lightning, Research, CNN, ResNet, Panels, Parameter Importance, Plots, Sweeps
Iterate on AI agents and models faster. Try Weights & Biases today.