Automating Change Log Tweets with Few-Shot Learning and GPT-3
A text summarization recipe using OpenAI GPT-3's few-shot learning and Weights & Biases
Created on September 22|Last edited on October 6
Comment
Introduction
One of these tweets was generated by GPT-3. Can you guess which one ?

Trick question. If you guessed any one of them, you were right.
All of these tweets were generated by the OpenAI GPT-3 text completion models. Developer tweets such as these are often used to inform users about new features and bug fixes. In addition to improving user engagement, tweets like this help create and nurture a community of support and feedback that is valuable for any software's growth and success. Nevertheless, this is a mundane task that can benefit from some automation.
This guide documents and presents a few recipes using OpenAI's GPT-3 and Weights & Biases to summarize release notes from a repository as a tweet.
Data Collection
Of course, first we need some data to achieve our goal. The data must contain pairs of release notes and their corresponding tweets. As a starting point we first collect a few tweets about our software release from twitter. Specifically, we collect tweets by @weights_biases that mentions the phrase "pip install wandb". We use snscrape to fetch the tweets programmatically as follows.
snscrape --jsonl --progress --max-results 100 --since 2019-06-01\twitter-search "pip install wandb from:weights_biases" > raw_tweets.jsonl>>Finished, 19 results
We were able to gather about 19 tweets and store them in a json-lines file named raw_tweets.jsonl . For repeatability, we'll create and upload artifact using wandb the artifact. To do this we run the following command:
raw_artifacts = wandb.Artifact("raw_dataset", type="dataset")raw_artifacts.add_file(raw_tweets_file, name="scraped_tweets")
A sample record from the scrapped tweets contains a lot of metadata related to a tweet. We are mainly interested in the following fields from the file tweet_url , date, content
{...<truncated>,"url": "https://twitter.com/weights_biases/status/1568250742363783168","date": "2022-09-09T14:51:28+00:00","content": "💥 wandb 0.13.3 released 💥\n7️⃣ new contributors!\n\n💫Improves performance when sending large messages\n💫Add better error handling when tearing down service\n💫Adds optional timeout to artifacts wait()\n🐛Fixes hang from keyboard interrupt on windows\n\n👉`pip install wandb --upgrade`",...<truncated>}
Notice that the tweet text contains the release version know as semvar - in the above example the semver is 0.13.3. this version can be used to pull the corresponding release notes from Github. We enhance the scraped data by extracting this version information from the text and validating the extracted version and save the dataset as a wandb.Table in our previously created raw_artifacts. You can expand the following two sections to see the code we used for these tasks:
Code to extract semvar (Click to expand)
Code to collect release notes (Click to expand)
from github import GithubGH_USER="wandb"GH_REPO="wandb"github_client = Github(os.environ['GH_TOKEN'])def load_from_tag(tag):try:release_data = (github_client.get_user(GH_USER).get_repo(GH_REPO).get_release(f"v{tag}"))if release_data.body.strip():release_notes = release_data.bodyreturn release_notesexcept Exception:return Nonedef load_release_notes(df):df["release_notes"] = df["semver"].map(load_from_tag)df = df[~df["release_notes"].isna()]df = df.reset_index(drop=True)return dfdf = load_release_notes(df)changelog_tweets_table = wandb.Table(dataframe=df)raw_artifacts.add(changelog_tweets_table, name="changelog_tweets")wandb.log_artifact(raw_artifacts)wandb.finish()
Finally, for easier visualization of the tweets and the release notes we convert the markdown formatted text into wandb.Html objects and log the dataset to our raw_artifacts. Here's a subset of columns from the formatted dataset.
Click on the full-screen icon that the top-left of any cell in the table below to see the formatted content.
💡
As we can see, each release note contains some formatted information related to changes, enhancements, bug fixes and new contributors. While each tweet summarizes the corresponding release using the following template
wandb <version> released <release summary> ... ... pip install wandb --upgrade
Data Preparation
Since we will be using the OpenAI GPT-3 models, we need to preprocess and format our data to perform our summarization task. Specifically, we remove redundant information from the change log, reduce the number of tokens in each text and normalize the text such that the meaning of the text is unaffected. Additionally, we also format and prepare the text to input into the model.
Data Cleaning
We clean up the release notes to remove unwanted text, i.e. text not necessarily useful for summarization and the template information from the tweets. In addition to reducing the complexity of the inputs, this also helps reduce the number of tokens we feed into the model.
At this stage, we reuse the raw_artifacts created from the data collection stage. The following snippet of code demonstrates how to do this.
import wandbrun = wandb.init(project=PROJECT_NAME, name="data_processing")raw_artifacts = wandb.use_artifact("raw_dataset:latest")raw_artifacts = raw_artifacts.wait()raw_changelog_tweets = raw_artifacts.get("changelog_tweets")raw_dataset = pd.DataFrame(raw_changelog_tweets.data,columns=raw_changelog_tweets.columns)
Cleaning up release notes
- Convert the release notes from markdown to text.
- Remove all urls from the release notes.
- Remove suffixes of the form by <user> in <PR> from the release notes text
- Remove all dates from the release notes.
- Remove empty lines and final changelog if present.
import refrom bs4 import BeautifulSoupimport markdownimport emojiUSRBY_RE = r"by \@\S+ in https\S+$" # by <user> in <pr>URL_RE = r"in https\S+" # in <pr>CH_RE = "^full changelog.*" # Full Changelog: ...DATE_RE = r"\(\w+ \d+, \d+\)" # (August 10, 2022)def cleanup_changelog(md_text):html = markdown.markdown(md_text)soup = BeautifulSoup(html, features="html.parser")text = re.sub(USRBY_RE, "", soup.text, 0, re.MULTILINE)text = re.sub(URL_RE, "", text, 0, re.MULTILINE)text = re.sub(CH_RE, "", text, 0, re.MULTILINE | re.IGNORECASE)text = re.sub(DATE_RE, "", text, 0, re.MULTILINE | re.IGNORECASE)text = text.strip()text = map(lambda x: x.strip(), text.splitlines())text = "\n".join(text)text = emoji.demojize(emoji.emojize(text, language="alias"))text = text.replace('&', '&')return text
Cleaning up tweets
Finally, we clean up to the dataset to remove any pairs where the change log is shorter than the tweet and store it as a wandb.Table in a new artifact called processed_dataset.
def cleanup_dataset(df):# clean up the changelogsdf["cleaned_logs"] = df["release_notes"].map(cleanup_changelog)cleaned_df = df[df["cleaned_logs"].str.split().map(len) >50]# clean up the tweetscleaned_df["cleaned_tweet"] = cleaned_df["tweet"].map(cleanup_tweet)cleaned_df = cleaned_df.reset_index(drop=True)return cleaned_dfcleaned_dataset = cleanup_dataset(raw_dataset)processed_artifacts = wandb.Artifact("processed_dataset", type="dataset")processed_artifacts.add(wandb.Table(dataframe=cleaned_dataset), name="cleaned_data")
Data Formatting
The GPT-3 model is capable of performing a wide variety of text completion tasks. Given the small amount of data we were able to collect we pose the summarization as a text completion task. The API documentation also provides some useful pointers related to prompt design for text-summarization We incorporate these tips in designing prompts for three different settings of text completion
Zero-shot Prompt
In this setting the prompt is created by formatting the text as follows
Summarize as a tweet: [changelog]: <the cleaned up changelog text> [tweet]:
The following codeblock shows how we do this programmatically and save the output into our processed_dataset artifact.
def make_zero_shot_prompt(row, add_start=True):start_seq = "summarize as a tweet:\n\n"log = row["cleaned_logs"].strip()prompt = "[changelog]:\n\n" + log + "\n\n[tweet]:\n\n"if add_start:prompt = start_seq + promptreturn promptcleaned_dataset["prompt"] = cleaned_dataset.apply(make_zero_shot_prompt, axis=1)processed_artifacts.add(wandb.Table(dataframe=cleaned_dataset), name="zeroshot_dataset")
One-shot Prompt
In this case we present an single example to the model in the prompt. The format looks as follows:
Summarize as a tweet: [changelog]: <the cleaned up changelog text> [tweet]: <the cleaned up tweet text> [changelog]: <the cleaned up changelog text> [tweet]:
Here's the code to do this in our dataset
def make_one_shot_prompt(row):start_seq = "summarize as a tweet:\n\n"log = row["cleaned_logs"].strip()tweet = row["cleaned_tweet"]prompt = start_seq + "[changelog]:\n\n" + log + "\n\n[tweet]:\n\n" + tweet + "\n\n###\n\n"return promptdef make_one_shot_dataset(dataset):records = []prompts = []for idx in range(1, len(dataset)):current_row = dataset.iloc[idx]previous_row = dataset.iloc[idx-1]first_half = make_one_shot_prompt(previous_row)second_half = make_zero_shot_prompt(current_row, add_start=False)prompt = first_half + second_halfprompts.append(prompt)records.append(current_row)data = pd.DataFrame(records)data["prompt"] = promptsreturn dataone_shot_dataset = make_one_shot_dataset(cleaned_dataset)processed_artifacts.add(wandb.Table(dataframe=one_shot_dataset),name="oneshot_dataset")wandb.log_artifact(processed_artifacts)
Few-Shot Prompt
In this setting we concatenate all historic change logs and tweets as a prompt similar to the one-shot setting. However, we do this dynamically at inference time, treating the prompt_length and the ordering of each change log + tweet pair in the prompt as hyperparameters. We also need to make sure that the prompts are not longer than the max sequence length of the model. The following code snippet shows how this is done.
def make_history(df, prompt_length=2000, shuffled=False):if shuffled:df = df.sample(frac=1, replace=False)total_length = 0prompts = []for idx, item in df.iloc[::-1].iterrows():if item["prompt_length"] + total_length < prompt_length:total_length += item["prompt_length"]row_prompt = ("[changelog]:\n\n"+ item["cleaned_logs"].strip()+ "\n\n[tweet]:\n\n"+ item["cleaned_tweet"].strip()+ "\n\n###\n\n")prompts.append(row_prompt)return "\n".join(prompts[::-1])def make_few_shot_dataset(df, prompt_length, shuffled):prompts = []tweets = []for idx in range(1, len(df)):current_row = df.iloc[idx]previous_rows = df.iloc[:idx]history_length = prompt_length - current_row["log_length"] - 10history_prompt = make_history(previous_rows, history_length, shuffled=shuffled)prompt = (history_prompt+ "[changelog]:\n\n"+ current_row["cleaned_logs"].strip()+ "\n\n[tweet]:\n\n")prompts.append(prompt)tweets.append(current_row["cleaned_tweet"].strip())return pd.DataFrame({"prompt": prompts, "reference": tweets}, index=range(1, len(df)))
Summarization Sweeps
In addition to designing good prompts we also need to define different parameters to pass to the OpenAI GPT-3 API completion endpoint. These parameters include temperature and top_p settings that control how deterministic the model is in generating a response, frequency and presence penalties that control likelihood of sampling repetitive sequences of tokens in an output, and model that specifies the GPT-3 model to use.
We make use of wandb.sweeps to find optimal values for these parameters for the three different settings. However, to run sweeps we need to use an automated metric for evaluation. Since, summarization is a rather subjective task we will use a proxy metric that tries to address the following:
bleu - number of n-grams in the generated tweet that appear in the reference tweet. rouge - number of n-grams in the reference tweet that appear in the generated tweet. bertscore - semantic similarity score between generated and reference tweet
We use the evaluate package to programmatically compute an mean score of the above three scores as follows
Evaluation Code (click to expand)
Zero-shot Sweep
In this sweep we use zero-shot prompts we made previously to identify a good set of parameters for creating tweets in a setting where no historic tweets are available. Since the API is stochastic in nature we generate 5 completions for each prompt and compute the mean score over the 5 completions.
Sweep configuration (click to expand)
Sweep Results
Example Generated Tweet
for v0.12.2
Bug Fix: Tensorflow/Keras 2.6 not logging validation examples.Fix metrics logged through tensorboard not supporting time on x-axis.Fix WANDB_IGNORE_GLOBS environment variable handling.Fix handling when sys.stdout is configured to a custom logger.Fix sklearn feature importance plots not matching feature names properly
Key Insights
- The overall task metric value is low here. This is understandable since the model has no idea on what kinda of tweets are expected by us.
- The best parameter setting based on the above sweep are as follows
model="text-curie-001max_tokens=80frequency_penalty=0.71presence_penalty=.46temperature=.95top_p=0.38
One-shot sweep
In this sweep we provide the model with a single example of a previous tweet in addition to the prompt. Here again, we generate 5 completions for each prompt and compute the mean score over the 5 completions. The sweep configuration is similar to the zero-shot setting.
Sweep Results
Example Generated Tweet
for v0.12.21
0.12.21 is out with bug fixes and new features!4️⃣ new contributors!💫 Fixes config not showing up until the run finish💫 Adds new types to the TypeRegistry to handling artifact objects in jobs and run configs🐞 Fixes stats logger so it can find all the
Key Insights
- The mean score increases from 0.38 in the zero-shot setting to 0.48 with a single example in the prompt.
- The model beings to generate tweets similar to our tweeting style.
- The best parameters based on the above sweep are as follows.
model="text-davinci-002max_tokens=80frequency_penalty=0.30presence_penalty=.29temperature=.76top_p=0.54
Few Shot Sweep
In this sweep, we provide all previous tweets in the prompt. In addition to the parameters provided to the model we also treat the prompt length as a hyperparameter in the search.
Example Generated Tweet
for v0.13.3
7️⃣ new contributors!💫 Adds the ability to log artifacts while linking to registered model💫 Adds parallelism to functional testing💫 Adds support for arguments in Launch Jobs🐞 Fixes issue triggered by colab update by using default file and catching exceptions
Key insights
- Few shot prompts improve the quality of tweets.
- The larger text-davinci-002 model often has the best performance.
- The parameters resulting in the best generations are as follows:
model="text-davinci-002max_tokens=80frequency_penalty=0.60presence_penalty=.82temperature=.34top_p=0.42prompt_lenght=3000
Conclusion
In this report, we saw how to use OpenAI's GPT-3 to automate the routine task of tweeting about software release notes. We created a small dataset and visualized it using wandb Tables. We also evaluated the task of text summarization and used wandb Sweeps to find an appropriate set of generation hyperparameters for the task. The full code for this report and automation recipe is documented in this github repo.
If you end up creating something similar, drop it in the comments! We'd love to check it out. Thanks for reading.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.