Deploy Sentiment Analyzer Using SageMaker and W&B
Introduction
In this report we will use AWS Sagemaker to train, tune and deploy our model – a sentiment classifier trained on the IMDB movie reviews dataset. We will track our model architectures, compare our model performance and find the best set of hyperparameters using Weights & Biases.
AWS SageMaker
Amazon SageMaker is a fully-managed service that enables data scientists and developers to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker includes modules that can be used together or independently to build, train, and deploy your machine learning models.
Using W&B with SageMaker
Authentication
W&B looks for a file named secrets.env
relative to the training script and loads it into the environment when wandb.init()
is called. You can generate a secrets.env
file by calling wandb.sagemaker_auth(path="source_dir")
in the script you use to launch your experiments. Be sure to add this file to your .gitignore
!
Existing Estimators
If you're using one of SageMakers preconfigured estimators you need to add a requirements.txt
to your source directory that includes wandb:
wandb
If you're using an estimator that's running Python 2, you'll need to install psutil directly from a wheel before installing wandb:
https://wheels.galaxyproject.org/packages/psutil-5.4.8-cp27-cp27mu-manylinux1_x86_64.whl
wandb
Next up, we'll walk through a detailed example of using SageMaker with W&B.
Step 1. Download and Pre-process the Data
Downloading the Dataset
Let us begin by downloading the IMDB dataset using AWS SageMaker notebook instance. It is a set of 25,000 highly polar movie reviews for training and 25,000 for testing. It consists of movie reviews from the website imdb.com, each labeled as either 'positive' if the reviewer enjoyed the film, or 'negative' otherwise.
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data
Preparing the Data
Now, let us read each of the reviews and combine them into a single input structure. Then, we will split the dataset into a training set and a testing set.
def read_imdb_data(data_dir='../data/aclImdb'):
data = {}
labels = {}
for data_type in ['train', 'test']:
data[data_type] = {}
labels[data_type] = {}
for sentiment in ['pos', 'neg']:
data[data_type][sentiment] = []
labels[data_type][sentiment] = []
path = os.path.join(data_dir, data_type, sentiment, '*.txt')
files = glob.glob(path)
for f in files:
with open(f) as review:
data[data_type][sentiment].append(review.read())
# positive review is '1' and a negative is '0'
labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
return data, labels
data, labels = read_imdb_data()
Now that we have split the raw data as train and test, the next step is to combine them and shuffle the result.
def prepare_imdb_data(data, labels):
#Combine positive and negative reviews and labels
data_train = data['train']['pos'] + data['train']['neg']
data_test = data['test']['pos'] + data['test']['neg']
labels_train = labels['train']['pos'] + labels['train']['neg']
labels_test = labels['test']['pos'] + labels['test']['neg']
data_train, labels_train = shuffle(data_train, labels_train)
data_test, labels_test = shuffle(data_test, labels_test)
return data_train, data_test, labels_train, labels_test
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
Now we have our train and test set. Let us have a look at an example data point.
print(train_X[100])
I got this film from a private collector and was very curious about it. It had a 7,8 in IMDb (9 votes only) and some
external comments were pleasant...
As we can see, our example input contains much useless information that we need to get rid of. In NLP tasks, HTML tags are useless. Besides, we wish to tokenize our input. That way, words such as entertained and entertaining are considered the same with regard to sentiment analysis.
def review_to_words(review):
nltk.download("stopwords", quiet=True)
stemmer = PorterStemmer()
text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
words = text.split() # Split string into words
words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
words = [PorterStemmer().stem(w) for w in words] # stem
return words
Let us test the above function on the previous example
print(review_to_words(text_X[100]) )
['got', 'film', 'privat', 'collector', 'curiou', '7', '8', 'imdb', '9', 'vote', 'extern', 'comment', 'pleasant',
'say', 'usual', 'uninterest', 'giallo', 'ye', 'great', 'cinematographi', 'film', 'well', 'direct', 'never',
'freak', 'start', 'well', 'although', 'bore', 'meat', 'stori', 'ordinari', 'thing', 'occur', 'normal', 'like',
'much', 'make', 'laugh', 'see', 'littl', 'tit', 'like', 'kind', 'giallo', 'like', 'bizarr', 'surreal', 'nonsens',
'gori', 'atmospher', 'brutal', 'murder', 'wont', 'appreci', 'much', 'film', 'give', '4', 'good', 'direct',
'edit', 'final', 'twist', 'make', 'film', 'entertain', 'could', 'much', 'better', '0']
The method below applies the review_to_words
method to each of the reviews in the training and testing datasets. In addition it caches the results. This is because performing this processing step can take a long time.
def preprocess_data(data_train, data_test, labels_train, labels_test,
cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
cache_data = None
if cache_file is not None:
try:
with open(os.path.join(cache_dir, cache_file), "rb") as f:
cache_data = pickle.load(f)
print("Read preprocessed data from cache file:", cache_file)
except:
pass # unable to read from cache, but that's okay
# If cache is missing, then do the heavy lifting
if cache_data is None:
words_train = [review_to_words(review) for review in data_train]
words_test = [review_to_words(review) for review in data_test]
if cache_file is not None:
cache_data = dict(words_train=words_train, words_test=words_test,
labels_train=labels_train, labels_test=labels_test)
with open(os.path.join(cache_dir, cache_file), "wb") as f:
pickle.dump(cache_data, f)
else:
# Unpack data loaded from cache file
words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
return words_train, words_test, labels_train, labels_test
train_X, test_X, train_y, test_y = preprocess_data(data, labels)
Now we have a pickle file that has the processed data. However, we are not entirely done yet. Computers do not understand words. So, the next step is to convert the sequence of words to a sequence of integers by building a word dictionary. Here we fix the size of our vocabulary (including the 'no word' and 'infrequent' categories) to be 5000, but you may wish to change this to see how it affects the model. We only need to construct mapping for 4998 words as index 0 and 1 will be used for describing 'no word' and 'infrequent word' respectively.
import numpy as np
def build_dict(data, vocab_size = 5000):
word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
for review in data:
for word in review:
word_count[word] = word_count[word] + 1 if word in word_count else 1
word_count_sorted = sorted(word_count.items(), key=(lambda item: item[1]), reverse=True)
sorted_words = [item[0] for item in word_count_sorted]
word_dict = {} # This is what we are building, a dictionary that translates words into integers
for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
word_dict[word] = idx + 2 # 'infrequent' labels
return word_dict
word_dict = build_dict(train_X)
Now that we have our word dictionary which allows us to transform the words appearing in the reviews into integers, it is time to make use of it and convert our reviews to their integer sequence representation, making sure to pad or truncate to a fixed length, which in our case is 500.
def convert_and_pad(word_dict, sentence, pad=500):
NOWORD = 0 # We will use 0 to represent the 'no word' category
INFREQ = 1 # and we use 1 to represent unseen words
working_sentence = [NOWORD] * pad
for word_index, word in enumerate(sentence[:pad]):
if word in word_dict:
working_sentence[word_index] = word_dict[word]
else:
working_sentence[word_index] = INFREQ
return working_sentence, min(len(sentence), pad)
def convert_and_pad_data(word_dict, data, pad=500):
result = []
lengths = []
for sentence in data:
converted, leng = convert_and_pad(word_dict, sentence, pad)
result.append(converted)
lengths.append(leng)
return np.array(result), np.array(lengths)
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)
Step 2. Upload Data to S3 Bucket
We will need to upload the training dataset to an Amazon S3 bucket for our training code to access it. For now, we will save it locally, and we will upload it to S3 later. It is essential to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form label, length, review[500] where review[500] is a sequence of 500 integers representing the words in the review. Let us first save the data locally in a CSV file.
import pandas as pd
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
.to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)
Uploading the training data
Next, we need to upload the training data to the SageMaker default S3 bucket to provide access to it while training our model.
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'
role = sagemaker.get_execution_role()
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)
Step 3. Tune Hyper-Parameters of a Chosen Model
A SageMaker model is composed of 3 objects
- Model Artifacts
- Training Code
- Inference Code each of which interacts with one another. We will start by implementing our own neural network in PyTorch along with a training script. Here's our LSTM network.
class LSTMClassifier(nn.Module):
def __init__(self, embedding_dim, hidden_dim, vocab_size):
super(LSTMClassifier, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
self.lstm = nn.LSTM(embedding_dim, hidden_dim)
self.dense = nn.Linear(in_features=hidden_dim, out_features=1)
self.sig = nn.Sigmoid()
self.word_dict = None
def forward(self, x):
x = x.t()
lengths = x[0,:]
reviews = x[1:,:]
embeds = self.embedding(reviews)
lstm_out, _ = self.lstm(embeds)
out = self.dense(lstm_out)
out = out[lengths - 1, range(len(lengths))]
return self.sig(out.squeeze())
The training code is in the file train.py
of the repo, but let us look at the snippet containing the training loop.
W&B integration
We will use W&B to log the metrics and visualizations of our runs that will provide valuable insights for hyper-parameter tuning. Using W&B with SageMaker is quite straightforward:
- Import
- Authenticate
- Log your metrics
You can authenticate via command line or by creating a secrets.env
file in source_dir
containing your W&B API key. As this file is an entry point for sagemaker's HyperparameterTuner
job, we'll need to authenticate via the secrets.env
file. You can read more on authentication in the docs.
import wandb
wandb.login() # This will look for WANDB_API_KEY env variable provided by secrets.env
'''
Helper Functions and Arg Parsers
'''
args = parser.parse_args()
# Set defaults if we dont have values from SageMaker
print('Training start')
config_defaults={
'epochs':10,
'batch_size':128,
'lr':0.01,
'vocab_size':5000
}
wandb.init(project='experiment_name',config=config_defaults)
config = wandb.config
def train(model, train_loader, epochs, optimizer, loss_fn, device):
for epoch in range(1, epochs + 1):
model.train()
total_loss = 0
for batch in train_loader:
batch_X, batch_y = batch
batch_X = batch_X.to(device)
batch_y = batch_y.to(device)
# TODO: Complete this train method to train the model provided.
optimizer.zero_grad()
y_hat = model(batch_X)
loss = loss_fn(y_hat, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.data.item()
wandb.log({"Batch Loss": loss.data.item()})
wandb.log({"BCE Loss": total_loss / len(train_loader)})
print("Epoch: {}, BCE Loss: {}".format(epoch, total_loss / len(train_loader)))
Hyperparameter Tuning
When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the train directory is a file called train.py which has been provided and which contains most of the necessary code to train our model.
from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point="train.py",
source_dir=os.getcwd() + "/train",
role=role,
framework_version='0.4.0',
train_instance_count=1,
train_instance_type='ml.p2.xlarge')
# Here we're only considering `learning rate` and `batch size` for tuning but `vocab size`
# and `string lenght` are equally important hyper-parameters.
hyperparameter_ranges = {
'lr': CategoricalParameter([0.0001,0.001 ,0.01]),
'batch_size': CategoricalParameter([128, 256, 512]),
}
Let us now define the objective metric to be optimized by the tuner. Here we'll define a Regex that will describe our metric. The tuner will look for a matching pattern in the console logs of the process. The best tuning job will be chosen based on this metric.
objective_metric_name = 'BCE Loss'
objective_type = 'Minimize'
metric_definitions = [{'Name': 'BCE Loss',
'Regex': 'BCE Loss: ([0-9\\.]+)'}]
Now let us put it all together and deploy a HyperparameterTuner
.
tuner = HyperparameterTuner(estimator,
objective_metric_name,
hyperparameter_ranges,
metric_definitions,
max_jobs=9,
max_parallel_jobs=1,
objective_type=objective_type)
tuner.fit({'training': input_data})
This will deploy a tuner in the SageMaker dashboard. There is a total of 9 possible hyper-parameter combinations in this case. So, if everything goes right, we'll have a job summary in the hyperparameter tuning jobs
section of SageMaker dashboard
Now that we have completed the tuning job, let us go to the W&B dashboard of this project to see the logged runs during the training process.
Step 4. Deploy And Test the model
Deploying a model on AWS Sagemaker requires the following steps.
- Write Inference Code
- Deploy the estimator to an endpoint
- Setup a Lambda function
- Setup API Gateway
- Deploy the model on the webApp
- (Optional) Delete the endpoint after you're done
All of these steps are explained in detail in the repository notebook. Once you're done with these steps, go to the website
folder in the repo. there should be a file called index.html
. Download the file to your computer and open that file up in a text editor of your choice. There should be a line that contains REPLACE WITH PUBLIC API URL. Replace that line with the URL of API you just built and open the file in the browser.
Now, you've successfully deployed a sentiment analyzer
Here are some example runs.