Natural Language Processing
0. Recap of Neural Networks
Concepts
Perceptron The Perceptron takes in some input values and generates an output value - via calculating the weighted sum of input values then applying an activation function (nonlinear transformation) to it.
Multilayer Perceptron (MLP) A Multilayer Perceptron is a feedforward neural network with many fully-connected layers that uses nonlinear activation functions. An MLP is the most basic form of a deep neural net if it has more than 2 layers.
Backpropagation Backpropagation is an algorithm to help a neural network learn by calculating the gradients, or a feedforward computational graph. It does so by propagating the gradients backwards into the network via differentiating them using the chain rule.
Building Blocks of Neural Networks
Activation Function Activation functions (sigmoid, tanh, ReLU) are non-linear functions that are applied to the weighted sum of the inputs of a neuron. They enable neural nets to go beyond linear functions and approximate any complex function.
Batch Size Total number of training examples present in a single batch. Large batch sizes can be great because they can harness the power of GPUs to process more training instances per time. Small batch sizes' performance generalizes better and are less memory intensive.
Categorical Cross-Entropy Loss The categorical cross-entropy loss aka the negative log likelihood is a popular loss function for classification tasks, that measures the similarity between two probability distributions – the true and predicted labels.
Dropout Dropout is a fantastic regularization technique that gives you a massive performance boost (~2% for state-of-the-art models) for how simple the technique actually is. All dropout does is randomly turn off a percentage of neurons at each layer, at each training step. This makes the network more robust because it can’t rely on any particular set of input neurons for making predictions. The knowledge is distributed amongst the whole network. Around 2^n (where n is the number of neurons in the architecture) slightly-unique neural networks are generated during the training process, and ensembled together to make predictions.
Early Stopping Early Stopping lets you train a model with more hidden layers, hidden neurons and for more epochs than you need – then just stopping training when performance stops improving consecutively for n epochs. It saves the best performing model for you and prevents overfitting.
Epoch Epochs: number of times your network sees your data. One epoch is when an entire dataset is passed both forward and backward through the neural network only once.
Gradient Descent Gradient descent finds the minimum value of a function by using an iterative optimization algorithm for differentiable functions. We use it to find the lowest point in our loss function.
Hidden Layer A hidden layer in neural network is a layer in between input layers and output layers, where neurons take in a set of weighted inputs and produce an output through an activation function.
Hyperparameter Tuning Hyperparameter tuning is the process of searching the hyperparameter space to find the best hyperparameters values, through grid, random or bayesian search.
Learning Rate Learning rate is a scalar used to update model parameters during gradient descent. It is the factor by which we multiply the gradients. Used to determine the amount by which the weights are updated during training.
Momentum Gradient Descent takes tiny, consistent steps towards the local minima and when the gradients are tiny it can take a lot of time to converge. Momentum takes into account the previous gradients & accelerates small but consistent gradients. It accelerates convergence by pushing over valleys faster & avoiding local minima.
One Hot Encoding Many machine learning algorithms cannot operate on label data directly. They need all input variables and output variables to be numeric. In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves. This means that categorical data must be converted to a numerical form. This is called one hot encoding.
Basic Neural Network
# keras-perceptron/perceptron-normalize.py
import wandb
import tensorflow as tf
# logging code
run = wandb.init(entity="wandb", project="bloomberg-class")
config = run.config
config.concept = 'mlp'
# load data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
img_width = X_train.shape[1]
img_height = X_train.shape[2]
# normalize data
X_train = X_train.astype('float32') / 255.
X_test = X_test.astype('float32') / 255.
# one hot encode outputs
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)
labels = [str(i) for i in range(10)]
num_classes = y_train.shape[1]
# create model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(img_width, img_height)))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dropout(0.4))
model.add(tf.keras.layers.Dense(num_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test),
callbacks=[wandb.keras.WandbCallback(data_type="image", labels=labels, save_model=False)])
1. Recurrent Neural Networks
Deep Dive
Videos:
- Recurrent Neural Networks (RNN) and Long Short-Term Memory
- Understanding RNN and LSTM by Brandon Rohrer
Concepts
Create Character Encodings
Bag of Words
Bag of words is a method of feature engineering for text. In the resulting BoW feature vector, each dimension represents whether a specific token is present in the corpus. BoW ignores the order of words.
Embedding
Embeddings map inputs like words or sentences to vectors of numbers.
GloVe
GloVe is an unsupervised algorithm to convert words into embeddings trained on co-occurrence statistics.
word2vec
word2vec is an algorithm that also learns word embeddings by trying to predict the context of words in a document. word2vec vectors have mathematical properties and can be added and subtracted. e.g. vector('queen') = vector('king') - vector('man') + vector('woman')
.
# Inspect data
vocab_size = 10000
embedding_dims = 50
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocab_size)
print(X_train[0])
print(decode(X_train[0]))
See example in Jupyter notebook
Recurrent Neural Networks
RNNs are an ML algorithm used to model sequential data by saving hidden states. It uses the same parameters and performs the same calculations at each step, with different inputs. At each time step, it calculates a new hidden state (“memory”) based on the current input and the previous hidden state and persists the information by using an internal loop in the network.
RNNs are able to capture and learn the order of inputs they receive. RNNs are used with sequential data, like in natural language processing. RNNs can succumb to vanishing and exploding gradient problems.
Long Short-Term Memory Unit (LSTM)
LSTM units in RNNs help combat the vanishing gradient problem. by using neurons with a memory cell and three gates:
- input – determines how much of information from the previous layer gets stored in the cell
- output – determines how the next layer gets to know about the state of the current cell
- forget – determines what to forget about the current state of the memory cell
# examples/lstm/imdb-classifier/imdb-embedding.py
import wandb
import tensorflow as tf
import numpy as np
import subprocess
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.python.client import device_lib
from tensorflow.keras.layers import LSTM, GRU, CuDNNLSTM, CuDNNGRU
from tensorflow.keras.datasets import imdb
import os
# set parameters:
wandb.init(entity="wandb", project="bloomberg-class")
config = wandb.config
config.concept = 'lstm-glove'
config.vocab_size = 1000
config.maxlen = 300
config.batch_size = 64
config.embedding_dims = 50
config.filters = 250
config.kernel_size = 3
config.hidden_dims = 100
config.epochs = 10
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=config.vocab_size)
X_train = sequence.pad_sequences(X_train, maxlen=config.maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=config.maxlen)
embeddings_index = dict()
f = open('glove.6B.50d.txt')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
embedding_matrix = np.zeros((config.vocab_size, config.embedding_dims))
for index in range(config.vocab_size):
word = id_to_word[index]
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[index] = embedding_vector
# overide LSTM & GRU
if 'GPU' in str(device_lib.list_local_devices()):
print("Using CUDA for RNN layers")
LSTM = CuDNNLSTM
GRU = CuDNNGRU
# create model
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(config.vocab_size, config.embedding_dims, input_length=config.maxlen,
weights=[embedding_matrix], trainable=True))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(config.hidden_dims))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X_train, y_train,
batch_size=config.batch_size,
epochs=config.epochs,
validation_data=(X_test, y_test), callbacks=[TextLogger(X_test[:20], y_test[:20]), wandb.keras.WandbCallback(save_model=False)])
BiDirecitonal LSTMs
Bidirectional RNNs are composed of two RNNs flowing in different directions, stacked on top of each other. The forward RNN reads the input sequence from start to end, while the backward RNN reads it from end to start. We combine their states by appending their vectors and in doing so we're able to make predictions using the context from before and after the words.
# examples/lstm/imdb-classifier/imdb-lstm.py
import wandb
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import LSTM, GRU
from tensorflow.python.client import device_lib
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.datasets import imdb
# set parameters:
wandb.init(entity="wandb", project="bloomberg-class")
config = wandb.config
config.concept = 'lstm-bidir'
config.vocab_size = 1000
config.maxlen = 300
config.batch_size = 32
config.embedding_dims = 50
config.filters = 250
config.kernel_size = 3
config.hidden_dims = 100
config.epochs = 10
# Load and tokenize input
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=config.vocab_size)
# Ensure all input is the same size
X_train = sequence.pad_sequences(
X_train, maxlen=config.maxlen)
X_test = sequence.pad_sequences(
X_test, maxlen=config.maxlen)
# overide LSTM & GRU
if 'GPU' in str(device_lib.list_local_devices()):
print("Using CUDA for RNN layers")
LSTM = tf.keras.layers.CuDNNLSTM
GRU = tf.keras.layers.CuDNNGRU
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(config.vocab_size,
config.embedding_dims,
input_length=config.maxlen))
model.add(tf.keras.layers.Bidirectional(LSTM(config.hidden_dims)))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(X_train, y_train,
batch_size=config.batch_size,
epochs=config.epochs,
validation_data=(X_test, y_test), callbacks=[TextLogger(X_test[:20], y_test[:20]),
wandb.keras.WandbCallback(save_model=False)])
Seq2Seq Translation
seq2seq models are a combination of two RNNs – one serving as the encoder, the other as the decoder. They are great for machine translation.
# examples/lstm/seq2seq/train.py
# adapted from https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
import tensorflow as tf
import numpy as np
import wandb
wandb.init(entity="wandb", project="bloomberg-class")
config = wandb.config
config.concept = 'seq2seq'
class CharacterTable(object):
"""Given a set of characters:
+ Encode them to a one hot integer representation
+ Decode the one hot integer representation to their character output
+ Decode a vector of probabilities to their character output
"""
def __init__(self, chars):
"""Initialize character table.
# Arguments
chars: Characters that can appear in the input.
"""
self.chars = sorted(set(chars))
self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
def encode(self, C, num_rows):
"""One hot encode given string C.
# Arguments
num_rows: Number of rows in the returned one hot encoding. This is
used to keep the # of rows for each data the same.
"""
x = np.zeros((num_rows, len(self.chars)))
for i, c in enumerate(C):
x[i, self.char_indices[c]] = 1
return x
def decode(self, x, calc_argmax=True):
if calc_argmax:
x = x.argmax(axis=-1)
return ''.join(self.indices_char[x] for x in x)
# Parameters for the model and dataset.
config.training_size = 10000
config.digits = 3
config.hidden_size = 64
config.batch_size = 64
2. Explaining NLP Predictions
Attention Maps
Attention Maps
Attention Mechanism Attention Mechanisms are inspired by human visual attention, the ability to focus on specific parts of an image. Attention mechanisms can be incorporated in both Language Processing and Image Recognition architectures to help the network learn what to “focus” on when making predictions.
# ExplainText
'''
Arguments:
matrix_values (arr): 2D dataset of shape x_labels * y_labels, containing
heatmap values that can be coerced into an ndarray.
x_labels (list): Named labels for rows (x_axis).
y_labels (list): Named labels for columns (y_axis).
show_text (bool): Show text values in heatmap cells.
'''
wandb.log({'heatmap_with_text': wandb.plots.HeatMap(x_labels, y_labels, matrix_values, show_text=False)})
Example - Neural Machine Translation from english → french.
ExplainText
eli5's LIME based TextExplainer.
tl;dr: helps debug model predictions by showing what features (words/phrases) were important to make this classification prediction.
# ExplainText
'''
Arguments:
text (str): Text to explain
probas (probabilities from a black-box classifier): A function which
takes a list of strings (documents) and returns a matrix
of shape (n_samples, n_classes) with probability values,
i.e. a row per document and a column per output label.
'''
wandb.log({'explain_nlp': wandb.plots.ExplainText(text=doc, probas=pipe.predict_proba, target_names=twenty_train.target_names)})
Example - Classify newsgroups by topic.
Below we train some scikit models to classify 18000 newsgroups posts into 20 topics using the 20 newsgroups dataset.
Behind The Scenes
Implements LIME's algorithm:
- generate distorted versions of the text;
- predict probabilities for these distorted texts using the black-box classifier;
- train another classifier (one of those eli5 supports) which tries to predict output of a black-box classifier on these texts.
Named Entity Recognition
spaCy's entity visualizer, which highlights named entities and their labels in a text.
# ExplainText
'''
Arguments:
docs (list, Doc, Span): Document(s) to visualize.
'''
wandb.log({'NER': wandb.plots.NER(docs=doc)})
Part of Speech Tagging
Adds support for spaCy's dependency visualizer which shows part-of-speech tags and syntactic dependencies based on both definition and context.
# ExplainText
'''
Arguments:
docs (list, Doc, Span): Document(s) to visualize.
'''
wandb.log({'part_of_speech': wandb.plots.POS(docs=doc)})
Confusion Matrices and ROC curves in wandb.log()
# ROC
wandb.log({'roc': wandb.plots.ROC(y_test, y_probas, nb.classes_)})
# Precision Recall
wandb.log({'pr': wandb.plots.precision_recall(y_test, y_probas, nb.classes_)})
# Confusion Matrices
wandb.sklearn.plot_confusion_matrix(y_test, y_pred, nb.classes_)```
3. Lets run through an example
The Goal - Find Unintended Bias in Toxic Tweets
The Dataset
Toxicity Subtypes Distribution
Toxic Subtypes and Identity Correlation
Lexical Analysis
Toxicity by Identity Tags (Frequency)
Weighted Analysis of Most Frequently Toxic Tags
Correlation between identities - which identities are mentioned together?
Time Series Analysis of Toxicity
Word Clouds
All Identities
Emoji Usage in Toxic Comments
Word Embeddings
Word embeddings accept text corpus as an input and outputs a vector representation for each word. We use t-SNE to draw a scatter plot of similar words in the embedding space.
Affect of Embeddings on Bias
Bias Benchmarks
-
Subgroup AUC: The AUC score for the entire subgroup- a low score here means the model fails to distinguish between toxic and non-toxic comments that mention this identity.
-
BPSN AUC: Background positive, subgroup negative. A low value here means the model confuses non-toxic examples that mention the identity with toxic examples that do not.
-
BNSP AUC: Background negative, subgroup positive. A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not.
The final score used in this competition is a combination of these bias metrics, which we will also compute.
No Pretrained Embeddings - Final Metric: 0.90
GloVE - Final Metric: 0.9230
FastText - Final Metric: 0.9228
Concatenate GloVe and Fasttext - Final Metric: 0.9234
Model Interpretation - Named Entity Recognition, Eli5
NER
displacy.render(nlp(str(sentence)), jupyter=True, style='ent')
TextExplainer
Let's use ELI5 to see how model makes predictions.