Introduction

In this project, we create a tweet generator by fine-tuning a pre-trained transformer on a user's tweets using HuggingFace Transformers – a popular library with pre-trained architectures and frameworks for NLP.

We also use Weights & Biases integration to automatically log model performance and predictions.

Try it yourself →

Disclaimer: this project is not to be used to publish any false generated information but to perform research on Natural Language Generation

Sample Predictions

Let's look at the predictions our model makes.

I was impressed by the quality of the results with so little data: the model learns @ mentions, hashtags and even emojis while making sentences that seem to be capturing the Tweeter's mind!

Sample Predictions

General Overview

The model uses the following pipeline.

pipeline

Overall, the quality of our results (vs a pre-trained model) is due to:

Building the dataset

Downloading tweets

We first gather tweets of a specific user through Twitter API, which let us download "only" the last 3,200 tweets. This includes retweets, very short tweets, etc… which we don't keep. This actually corresponds to only 100kB to 300kB of data so several orders of magnitude smaller than typical NLP datasets!

We use Tweepy which provides a nice interface to Twitter API and let us download tweets with simple commands such as:

auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)
tweets = api.user_timeline(screen_name)

Optimizing the dataset

The largest improvements we got in the predictions were due to careful exploration and pre-processing of the data.

Initially, we only performed basic clean-up:

Since our dataset is so small, we want to make sure we don't waste our neurons learning information we don't care about:

The model is going to try to predict next characters (actually next bytes) from a start sequence.

In order to feed the model tweets and have them treated "independently" (in reality several are read in a single batch), we separate tweets using the special token "<|endoftext|>", used in pre-training by Open AI to separate documents. Our dataset then becomes something like:

<|endoftext|>This is my first tweet!<|endoftext|>Second tweet already!<|endoftext|>

Note: having no space around <|endoftext|> empirically leads to better predictions.

We mix up the tweets at each epoch so that the model does not see correlation where there should not be.

We tried adding a special <tweet> token but it did not help get better results, due to the small size of the dataset.

Initial experiments

We use a pre-trained GPT-2 model (the "small" variant) and fine-tune it for multiple epochs on the tweets using HuggingFace Transformers library.

This library contains nice scripts for fine-tuning models (run_language_modeling.py) or generating text (run_generation.py) which are ideal for the prototyping phase.

Experiments on different people show that we have more chances of over-fitting after 4 epochs. For this experiment, we actually split tweets between a training set (80% of data) and a validation set (remaining 20% of data).

Training the model

Comparing Losses Between Users

It is interesting to compare losses between users:

Training the model

Fine-Tuning the Model

After doing a few initial experiments, we observe that the generated predictions can get much better with some tuning:

We run sweeps and observe how our hyper-parameters affect the validation loss on different Twitter users.

Introduction

The parameter importance table let us see which parameters are the most important (by running a random forest on input parameters vs validation loss).

We make sure to ignore the Twitter handle from the parameter importance table as we want to create a model that works on any account. Otherwise it would tell us to avoid using @karpathy, which increases a lot the validation loss.

Introduction

We use the parallel coordinates graph and the parameter importance table to iteratively refine our parameters with following objectives:

The initial sweep let us reduce parameter intervals to 5% of initial input space for a second sweep.

Unfortunately, the best set of hyper-parameters depends on the Twitter handle as we can see on below graph.

Section 22

In order to get consistent results across diverse Twitter handles, we run each set of hyper-parameters on 15 different Twitter handles and consider the mean validation loss + 1 standard deviation (encourages more robust results).

Section 15

Due to the size of the datasets (and random shuffling), our results can be noisy. We slowly refine our interval to ensure we get consistently good scores.

We end up choosing:

Future Research

I still have more research to do:

About

Built by Boris Dayma

Follow

My main goals with this project are:

For more details, visit the project repository.

GitHub stars

Resources

Got questions about W&B?

If you have any questions about using W&B to track your model performance and predictions, please reach out to the slack community.