Skip to main content

Hugging Tweets: Train a Model to Generate Tweets

In this article, we demonstrate how to fine-tune a pre-trained HuggingFace Transformer on anyone's Tweets in five minutes
Created on May 22|Last edited on December 15
In this article, we create a tweet generator by fine-tuning a pre-trained transformer on a user's tweets using HuggingFace Transformers – a popular library with pre-trained architectures and frameworks for NLP.
We also use Weights & Biases integration to automatically log model performance and predictions.

Disclaimer: This project is not to be used to publish any falsely generated information but to perform research on Natural Language Generation

Table of Contents



Let's get started.

Sample Predictions

Let's look at the predictions our model makes.
I was impressed by the quality of the results with so little data: the model learns @ mentions, hashtags, and even emojis while making sentences that seem to be capturing the Tweeter's mind!

Run set
14




General Overview

The model uses the following pipeline.

Overall, the quality of our results (vs a pre-trained model) is due to:
  • 85% → curating the dataset (requires extra thought and manual exploration)
  • 15% → fine-tuning (easy and automatic, I just used W&B sweeps)

Building the Dataset

Let's take a look at how to build the dataset:

Downloading Tweets

We first gather tweets of a specific user through Twitter API, which let us download "only" the last 3,200 tweets. This includes retweets, very short tweets, etc… which we don't keep. This actually corresponds to only 100kB to 300kB of data so several orders of magnitude smaller than typical NLP datasets!
We use Tweepy which provides a nice interface to Twitter API and let us download tweets with simple commands such as:
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)
tweets = api.user_timeline(screen_name)

Optimizing the Dataset

The largest improvements we got in the predictions were due to careful exploration and pre-processing of the data.
Initially, we only performed basic clean-up:
  • remove all retweets (since they are not from the user we try to learn from)
  • fix some special character encoding such as &, >, <
Since our dataset is so small, we want to make sure we don't waste our neurons learning information we don't care about:
  • we remove URLs and picture links
  • we remove short or "boring" tweets such as "Thank you", "Cool", etc
  • we remove extra spaces which somehow are often present in tweets (this unexpectedly leads to much better results)
The model is going to try to predict the next characters (actually the next bytes) from a start sequence.
In order to feed the model tweets and have them treated "independently" (in reality several are read in a single batch), we separate tweets using the special token "<|endoftext|>", used in pre-training by Open AI to separate documents. Our dataset then becomes something like this:
<|endoftext|>This is my first tweet!<|endoftext|>Second tweet already!<|endoftext|>
Note: having no space around <|endoftext|> empirically leads to better predictions.
We mix up the tweets at each epoch so that the model does not see a correlation where there should not be.
We tried adding a special <tweet> token but it did not help get better results, due to the small size of the dataset.

Initial Experiments

We use a pre-trained GPT-2 model (the "small" variant) and fine-tune it for multiple epochs on the tweets using HuggingFace Transformers library.
This library contains nice scripts for fine-tuning models (run_language_modeling.py) or generating text (run_generation.py) which are ideal for the prototyping phase.
Experiments on different people show that we have more chances of over-fitting after 4 epochs. For this experiment, we actually split tweets between a training set (80% of data) and a validation set (the remaining 20% of data).

Run set
3


Comparing Losses Between Users

It is interesting to compare losses between users:
  • some runs are longer on certain people, as they have more data (so more batches per epoch), either due to a larger number of tweets (vs retweets) or longer average tweet (Jack Clark and François Chollet seem to be the most talkative here).
  • Andrej Karpathy's loss is much higher than everybody else's (I actually verified it multiple times on independent runs), meaning he's the most unpredictable to the model.


Run set
8


Fine-Tuning the Model

After doing a few initial experiments, we observe that the generated predictions can get much better with some tuning:
  • Better clean up of the data (extra spaces, the position of special tokens, etc)
  • Shuffling tweets between each epoch (so that batches don't contain the same succession of tweets)
  • Varying learning rate scheduler, number of epochs, learning rate, etc
We run sweeps and observe how our hyper-parameters affect the validation loss on different Twitter users.

Run set
260

The parameter importance table lets us see which parameters are the most important (by running a random forest on input parameters vs validation loss).
We make sure to ignore the Twitter handle from the parameter importance table as we want to create a model that works on any account. Otherwise, it would tell us to avoid using @karpathy, which increases a lot the validation loss.

Run set
260

We use the parallel coordinates graph and the parameter importance table to iteratively refine our parameters with the following objectives:
  • we need a set of hyper-parameters that work well independently of the Twitter handle chosen ;
  • we want as few epochs as possible to have faster training (users are typically impatient).
The initial sweep let us reduce parameter intervals to 5% of initial input space for a second sweep.
Unfortunately, the best set of hyper-parameters depends on the Twitter handle as we can see in the graph below:


Run set
8

In order to get consistent results across diverse Twitter handles, we run each set of hyper-parameters on 15 different Twitter handles and consider the mean validation loss + 1 standard deviation (encourages more robust results).



Run set
83

Due to the size of the datasets (and random shuffling), our results can be noisy. We slowly refine our interval to ensure we get consistently good scores.
We end up choosing:
  • cosine learning scheduler
  • no gradient accumulation
  • no warmup
  • 4 epochs
  • learning rate of 1.37e-4

Future Research

I still have more research to do:
  • Evaluate how to "merge" two different personalities ;
  • Test training top layers vs bottom layers to see how it affects learning of lexical field (subject of the content) vs word predictions, memorization vs creativity ;
  • Augment text data with adversarial approaches ;
  • Pre-train on a large Twitter dataset of many people ;
  • Explore few-shot learning approaches as we have limited data per user though there are probably only a few writing styles ;
  • Implement a pipeline to continuously train the network on new tweets ;
  • Cluster users and identify topics, writing style…

About

Built by Boris Dayma

My main goals with this project are:
  • to experiment with how to train, deploy and maintain neural networks in production ;
  • to make AI accessible to everyone ;
  • to have fun!
For more details, visit the project repository.


Resources

Got questions about W&B?

If you have any questions about using W&B to track your model performance and predictions, please reach out to the slack community.

Iterate on AI agents and models faster. Try Weights & Biases today.