Skip to main content

Predicting Semantic Similarity using KerasNLP

This report demonstrates how to use the KerasNLP to predict semantic similarity.
Created on September 2|Last edited on November 13
Semantic similarity refers to the task of determining the degree of similarity between two sentences in terms of their meaning. This example uses SNLI (Stanford Natural Language Inference) corpus to predict sentence semantic similarity. We will learn how to use KerasNLP library, an extension of the core Keras API, for the same task. Furthermore, we will discover how KerasNLP effectively reduces boilerplate code and simplifies the process of building and utilizing models. For more information on KerasNLP, please refer to KerasNLP's official documentation.
The code in this report is available as a Kaggle Notebook.
💡


Getting Started with KerasNLP

Let's first install keras-nlp
!pip install -q keras-nlp

Next up, let's import all the necessary libraries and set backend framework for Keras. (KerasNLP supports new KerasCore out-of-the-box).
import os

os.environ["KERAS_BACKEND"] = "jax" # or "tensorflow" or "torch"

import numpy as np
import tensorflow as tf
import keras_core as keras
import keras_nlp
import tensorflow_datasets as tfds
import wandb
from wandb.keras import WandbMetricsLogger

PROJECT_NAME = "semantic-similarity-with-keras-nlp"


Overview of the SNLI Dataset

To load the SNLI dataset, we use the tensorflow-datasets library, which contains over 550,000 samples in total. However, to ensure that this example runs quickly, we use only 20% of the training samples. Every sample in the dataset contains three components:
  1. hypothesis,
  2. premise,
  3. and label
Each represents the original caption provided to the author of the pair, while the hypothesis refers to the hypothesis caption created by the author of the pair. The label is assigned by annotators to indicate the similarity between the two sentences.
The SNLI dataset contains three possible similarity label values:
  1. Contradiction represents completely dissimilar sentences,
  2. Entailment denotes similar meaning sentences.
  3. Neutral refers to sentences where no clear similarity or dissimilarity can be established between them.
snli_train = tfds.load("snli", split="train[:20%]")
snli_val = tfds.load("snli", split="validation")
snli_test = tfds.load("snli", split="test")

Preprocessing

In our dataset, we have identified that some samples have missing or incorrectly labeled data, which is denoted by a value of -1. To ensure the accuracy and reliability of our model, we simply filter out these samples from our dataset.

def filter_labels(sample):
return sample["label"] >= 0

def split_labels(sample):
x = (sample["hypothesis"], sample["premise"])
y = sample["label"]
return x, y


train_ds = (
snli_train.filter(filter_labels)
.map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
)

val_ds = (
snli_val.filter(filter_labels)
.map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
)

test_ds = (
snli_test.filter(filter_labels)
.map(split_labels, num_parallel_calls=tf.data.AUTOTUNE)
)

def get_batched_dataset(batch_size):
train_set = train_ds.batch(batch_size)
val_set = val_ds.batch(batch_size)
test_set = test_ds.batch(batch_size)
return train_set, val_set, test_set



Establishing a Semantic Similarity Baseline with BERT

We use the BERT model from KerasNLP to establish a baseline for our semantic similarity task. The keras_nlp.models.BertClassifier class attaches a classification head to the BERT Backbone, mapping the backbone outputs to a logit output suitable for a classification task. This significantly reduces the need for custom code.
KerasNLP models have built-in tokenization capabilities that handle tokenization by default based on the selected model. However, users can also use custom preprocessing techniques as per their specific needs. If we pass a tuple as input, the model will tokenize all the strings and concatenate them with a "[SEP]" separator. We use this model with pretrained weights, and we can use the from_preset() method to use our own preprocessor. For the SNLI dataset, we set num_classes to 3.
KerasNLP task models come with compilation defaults. We can now train the model we just instantiated by calling the fit() method.


with wandb.init(project=PROJECT_NAME, name="baseline") as run:
bert_classifier = keras_nlp.models.BertClassifier.from_preset(
"bert_tiny_en_uncased", num_classes=3
)
train_set, val_set, test_set = get_batched_dataset(512)
bert_classifier.fit(train_set, validation_data=val_set, epochs=1, callbacks=[WandbMetricsLogger(log_freq="batch")])
bert_classifier.evaluate(test_set, callbacks=[WandbMetricsLogger(log_freq="batch")])





Experimenting with Learning Rates


with wandb.init(project=PROJECT_NAME, name="change-lr-bs") as run:
bert_classifier = keras_nlp.models.BertClassifier.from_preset(
"bert_tiny_en_uncased", num_classes=3
)
bert_classifier.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=keras.optimizers.Adam(5e-5),
metrics=[keras.metrics.SparseCategoricalAccuracy()]
)

train_set, val_set, test_set = get_batched_dataset(512)

bert_classifier.fit(train_set, validation_data=val_set, epochs=1, callbacks=[WandbMetricsLogger(log_freq="batch")])

bert_classifier.evaluate(test_set, callbacks=[WandbMetricsLogger(log_freq="batch")])



Deploying Learning Rate Scheduler


class TriangularSchedule(keras.optimizers.schedules.LearningRateSchedule):
"""Linear ramp up for `warmup` steps, then linear decay to zero at `total` steps."""

def __init__(self, rate, warmup, total):
self.rate = rate
self.warmup = warmup
self.total = total

def get_config(self):
config = {"rate": self.rate, "warmup": self.warmup, "total": self.total}
return config

def __call__(self, step):
step = keras.ops.cast(step, dtype="float32")
rate = keras.ops.cast(self.rate, dtype="float32")
warmup = keras.ops.cast(self.warmup, dtype="float32")
total = keras.ops.cast(self.total, dtype="float32")

warmup_rate = rate * step / self.warmup
cooldown_rate = rate * (total - step) / (total - warmup)
triangular_rate = keras.ops.minimum(warmup_rate, cooldown_rate)
return keras.ops.maximum(triangular_rate, 0.0)

Now we will use the Traingular Learning Rate schedule we just defined.

with wandb.init(project=PROJECT_NAME, name="lr-schedule") as run:
bert_classifier = keras_nlp.models.BertClassifier.from_preset(
"bert_tiny_en_uncased", num_classes=3
)

train_set, val_set, test_set = get_batched_dataset(512)
# Get the total count of training batches.
# This requires walking the dataset to filter all -1 labels.
epochs = 3
total_steps = sum(1 for _ in train_ds.as_numpy_iterator()) * epochs
warmup_steps = int(total_steps * 0.2)

bert_classifier.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=keras.optimizers.AdamW(
TriangularSchedule(1e-4, warmup_steps, total_steps)
),
metrics=[keras.metrics.SparseCategoricalAccuracy()],
)

bert_classifier.fit(train_set, validation_data=val_set, epochs=epochs, callbacks=[WandbMetricsLogger(log_freq="batch")])
bert_classifier.evaluate(test_set, callbacks=[WandbMetricsLogger(log_freq="batch")])




Choosing the Right Hyperparameter for the Model with Sweeps!

Hyperparameter sweeps, enabling finding the best possible combinations of hyperparameter values for ML models for a specific dataset through orchestrating a Run and managing experiments for configuration of our training code.
Let's first define our sweep config.
import wandb

sweep_config = {
'project': PROJECT_NAME,
'method': 'random',
'run_cap': 6,
'metric': {
'name': 'accuracy',
'goal': 'maximize'
},
'parameters': {

'learning_rate': {
'values': [5e-6, 2e-5, 5e-5, 1e-4]
},
'batch_size': {
'values': [64, 128, 256, 512]
}
}
}
sweep_defaults = {
'learning_rate': 5e-5,
'batch_size': 512,
}


Next let's define our train function for Sweeps.
def train():

wandb.init(project=PROJECT_NAME, config=sweep_defaults)
bert_classifier = keras_nlp.models.BertClassifier.from_preset(
"bert_tiny_en_uncased", num_classes=3)
train_set, val_set, test_set = get_batched_dataset(wandb.config.batch_size)
optimizer = keras.optimizers.AdamW(learning_rate = wandb.config.learning_rate,
epsilon = 1e-8)
bert_classifier.compile(
loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=optimizer,
metrics=[keras.metrics.SparseCategoricalAccuracy()]
)
bert_classifier.fit(train_set, validation_data=val_set, epochs=2,
callbacks=[WandbMetricsLogger(log_freq="batch")])
bert_classifier.evaluate(test_set, callbacks=[WandbMetricsLogger(log_freq="batch")])

wandb.agent(sweep_id, function=train)



From the above visualization we can clearly see that batch-size of 128 and learning-rate of 0.0001 we achieve accuracy of 0.727. If we increase batch size to 256 while keeping the learning rate constant, accuracy falls down to 0.6235. In a different run, with batch-size of 64 and learning rate of 0.00002, we achieve accuracy of 0.674. If we keep learning rate same and increase batch size to 256, our accuracy falls down to 0.4161. Thus a general conclusion for this dataset and bert-tiny is to keep learning rate high and batch size low to get higher accuracy.

Conclusion

Throughout this tutorial, we demonstrated how to use a pretrained BERT model to establish a baseline and improve performance by trying different training settings. The KerasNLP toolbox provides a range of modular building blocks for preprocessing text, including pretrained state-of-the-art models and low-level Transformer Encoder layers. We believe that this makes experimenting with natural language solutions more accessible and efficient.
Soumik Rakshit
Soumik Rakshit •  
An overall suggestion is to make use of markdown panels inside panel grids.
Reply
Soumik Rakshit
Soumik Rakshit •  
NLI (Stanford Natural Language Inference) corpus Link the dataset.
1 reply
Soumik Rakshit
Soumik Rakshit •  
Panel
1. There seems to be an empty heading above this panel. Please remove it. 2. Tell us more about this sweep. How did it impact the result of the experiment? What insight can we gain from this sweep?
1 reply
Soumik Rakshit
Soumik Rakshit •  
'epochs':{ 'values':[1, 2] } This is probably not needed.
1 reply
Iterate on AI agents and models faster. Try Weights & Biases today.