Skip to main content

Deploy Sentiment Analyzer Using SageMaker and W&B

Exploring wandb's SageMaker integration to train, tune and deploy a sentiment analysis API
Created on July 24|Last edited on September 27

Introduction

In this report we will use AWS Sagemaker to train, tune and deploy our model – a sentiment classifier trained on the IMDB movie reviews dataset. We will track our model architectures, compare our model performance and find the best set of hyperparameters using Weights & Biases.

AWS SageMaker

Amazon SageMaker is a fully-managed service that enables data scientists and developers to quickly and easily build, train, and deploy machine learning models at any scale. Amazon SageMaker includes modules that can be used together or independently to build, train, and deploy your machine learning models.

Using W&B with SageMaker

Authentication

W&B looks for a file named secrets.env relative to the training script and loads it into the environment when wandb.init() is called. You can generate a secrets.env file by calling wandb.sagemaker_auth(path="source_dir") in the script you use to launch your experiments. Be sure to add this file to your .gitignore!

Existing Estimators

If you're using one of SageMakers preconfigured estimators you need to add a requirements.txt to your source directory that includes wandb:

wandb

If you're using an estimator that's running Python 2, you'll need to install psutil directly from a wheel before installing wandb:

https://wheels.galaxyproject.org/packages/psutil-5.4.8-cp27-cp27mu-manylinux1_x86_64.whl
wandb

Next up, we'll walk through a detailed example of using SageMaker with W&B.

Step 1. Download and Pre-process the Data

Downloading the Dataset

Let us begin by downloading the IMDB dataset using AWS SageMaker notebook instance. It is a set of 25,000 highly polar movie reviews for training and 25,000 for testing. It consists of movie reviews from the website imdb.com, each labeled as either 'positive' if the reviewer enjoyed the film, or 'negative' otherwise.

%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

Preparing the Data

Now, let us read each of the reviews and combine them into a single input structure. Then, we will split the dataset into a training set and a testing set.

def read_imdb_data(data_dir='../data/aclImdb'):
    data = {}
    labels = {}
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                     # positive review is '1' and a negative is '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                
    return data, labels

data, labels = read_imdb_data()

Now that we have split the raw data as train and test, the next step is to combine them and shuffle the result.

def prepare_imdb_data(data, labels):    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    return data_train, data_test, labels_train, labels_test

train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)

Now we have our train and test set. Let us have a look at an example data point.

print(train_X[100])
I got this film from a private collector and was very curious about it. It had a 7,8 in IMDb (9 votes only) and some 
external comments were pleasant...

As we can see, our example input contains much useless information that we need to get rid of. In NLP tasks, HTML tags are useless. Besides, we wish to tokenize our input. That way, words such as entertained and entertaining are considered the same with regard to sentiment analysis.

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
   
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

Let us test the above function on the previous example

print(review_to_words(text_X[100]) )

['got', 'film', 'privat', 'collector', 'curiou', '7', '8', 'imdb', '9', 'vote', 'extern', 'comment', 'pleasant',
 'say', 'usual', 'uninterest', 'giallo', 'ye', 'great', 'cinematographi', 'film', 'well', 'direct', 'never',
 'freak', 'start', 'well', 'although', 'bore', 'meat', 'stori', 'ordinari', 'thing', 'occur', 'normal', 'like',
 'much', 'make', 'laugh', 'see', 'littl', 'tit', 'like', 'kind', 'giallo', 'like', 'bizarr', 'surreal', 'nonsens',
 'gori', 'atmospher', 'brutal', 'murder', 'wont', 'appreci', 'much', 'film', 'give', '4', 'good', 'direct',
 'edit', 'final', 'twist', 'make', 'film', 'entertain', 'could', 'much', 'better', '0']

The method below applies the review_to_words method to each of the reviews in the training and testing datasets. In addition it caches the results. This is because performing this processing step can take a long time.

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    return words_train, words_test, labels_train, labels_test

train_X, test_X, train_y, test_y = preprocess_data(data, labels)

Now we have a pickle file that has the processed data. However, we are not entirely done yet. Computers do not understand words. So, the next step is to convert the sequence of words to a sequence of integers by building a word dictionary. Here we fix the size of our vocabulary (including the 'no word' and 'infrequent' categories) to be 5000, but you may wish to change this to see how it affects the model. We only need to construct mapping for 4998 words as index 0 and 1 will be used for describing 'no word' and 'infrequent word' respectively.

import numpy as np

def build_dict(data, vocab_size = 5000):
    word_count = {} # A dict storing the words that appear in the reviews along with how often they occur
    for review in data:
        for word in review:
            word_count[word] = word_count[word] + 1 if word in word_count else 1    
    word_count_sorted = sorted(word_count.items(), key=(lambda item: item[1]), reverse=True)
    sorted_words = [item[0] for item in word_count_sorted]
   word_dict = {} # This is what we are building, a dictionary that translates words into integers
    for idx, word in enumerate(sorted_words[:vocab_size - 2]): # The -2 is so that we save room for the 'no word'
        word_dict[word] = idx + 2     # 'infrequent' labels
    return word_dict

word_dict = build_dict(train_X)

Now that we have our word dictionary which allows us to transform the words appearing in the reviews into integers, it is time to make use of it and convert our reviews to their integer sequence representation, making sure to pad or truncate to a fixed length, which in our case is 500.

def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent unseen words
    working_sentence = [NOWORD] * pad
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
    return np.array(result), np.array(lengths)
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

Step 2. Upload Data to S3 Bucket

We will need to upload the training dataset to an Amazon S3 bucket for our training code to access it. For now, we will save it locally, and we will upload it to S3 later. It is essential to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form label, length, review[500] where review[500] is a sequence of 500 integers representing the words in the review. Let us first save the data locally in a CSV file.

import pandas as pd
    
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

Uploading the training data

Next, we need to upload the training data to the SageMaker default S3 bucket to provide access to it while training our model.


sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'
role = sagemaker.get_execution_role()

input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

Step 3. Tune Hyper-Parameters of a Chosen Model

A SageMaker model is composed of 3 objects

  • Model Artifacts
  • Training Code
  • Inference Code each of which interacts with one another. We will start by implementing our own neural network in PyTorch along with a training script. Here's our LSTM network.
class LSTMClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.dense = nn.Linear(in_features=hidden_dim, out_features=1)
        self.sig = nn.Sigmoid()        
        self.word_dict = None

    def forward(self, x):
       x = x.t()
        lengths = x[0,:]
        reviews = x[1:,:]
        embeds = self.embedding(reviews)
        lstm_out, _ = self.lstm(embeds)
        out = self.dense(lstm_out)
        out = out[lengths - 1, range(len(lengths))]
        return self.sig(out.squeeze())

The training code is in the file train.py of the repo, but let us look at the snippet containing the training loop.

W&B integration

We will use W&B to log the metrics and visualizations of our runs that will provide valuable insights for hyper-parameter tuning. Using W&B with SageMaker is quite straightforward:

  • Import
  • Authenticate
  • Log your metrics

You can authenticate via command line or by creating a secrets.env file in source_dircontaining your W&B API key. As this file is an entry point for sagemaker's HyperparameterTuner job, we'll need to authenticate via the secrets.env file. You can read more on authentication in the docs.

import wandb
wandb.login() # This will look for WANDB_API_KEY env variable provided by secrets.env
'''
Helper Functions and Arg Parsers
'''
args = parser.parse_args()
# Set defaults if we dont have values from SageMaker
print('Training start')
config_defaults={
   'epochs':10,
    'batch_size':128,
    'lr':0.01,
   'vocab_size':5000
}
wandb.init(project='experiment_name',config=config_defaults)
config = wandb.config

def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            # TODO: Complete this train method to train the model provided.
            optimizer.zero_grad()
            y_hat = model(batch_X)
            loss = loss_fn(y_hat, batch_y)
            loss.backward()
            optimizer.step()
            total_loss += loss.data.item()
            wandb.log({"Batch Loss": loss.data.item()})
        wandb.log({"BCE Loss": total_loss / len(train_loader)})
        print("Epoch: {}, BCE Loss: {}".format(epoch, total_loss / len(train_loader)))

Hyperparameter Tuning

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the train directory is a file called train.py which has been provided and which contains most of the necessary code to train our model.

from sagemaker.pytorch import PyTorch
estimator = PyTorch(entry_point="train.py",
                    source_dir=os.getcwd() + "/train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge')
# Here we're only considering `learning rate` and `batch size` for tuning but `vocab size`
# and `string lenght` are equally important hyper-parameters.
hyperparameter_ranges = {
    'lr': CategoricalParameter([0.0001,0.001 ,0.01]),
    'batch_size': CategoricalParameter([128, 256, 512]),
   
}

Let us now define the objective metric to be optimized by the tuner. Here we'll define a Regex that will describe our metric. The tuner will look for a matching pattern in the console logs of the process. The best tuning job will be chosen based on this metric.

objective_metric_name = 'BCE Loss'
objective_type = 'Minimize'
metric_definitions = [{'Name': 'BCE Loss',
                       'Regex': 'BCE Loss: ([0-9\\.]+)'}]

Now let us put it all together and deploy a HyperparameterTuner.

tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=9,
                            max_parallel_jobs=1,
                            objective_type=objective_type)

tuner.fit({'training': input_data})

This will deploy a tuner in the SageMaker dashboard. There is a total of 9 possible hyper-parameter combinations in this case. So, if everything goes right, we'll have a job summary in the hyperparameter tuning jobs section of SageMaker dashboard Screenshot from 2020-07-24 23-51-44.png

Now that we have completed the tuning job, let us go to the W&B dashboard of this project to see the logged runs during the training process.




Run set
16


Step 4. Deploy And Test the model

Deploying a model on AWS Sagemaker requires the following steps.

  1. Write Inference Code
  2. Deploy the estimator to an endpoint
  3. Setup a Lambda function
  4. Setup API Gateway
  5. Deploy the model on the webApp
  6. (Optional) Delete the endpoint after you're done

All of these steps are explained in detail in the repository notebook. Once you're done with these steps, go to the website folder in the repo. there should be a file called index.html. Download the file to your computer and open that file up in a text editor of your choice. There should be a line that contains REPLACE WITH PUBLIC API URL. Replace that line with the URL of API you just built and open the file in the browser.

Now, you've successfully deployed a sentiment analyzer

Here are some example runs. webapp.gif test_review.gif