Tables Tutorial: Visualize Text Data & Predictions

A guide on how to log and organize text data and language model predictions with our old friend William Shakespeare. Made by Stacey Svetlichnaya using W&B
Stacey Svetlichnaya
W&B Tables—our latest feature for dataset and prediction visualization—isn't solely for computer vision projects. Tables also extends to natural language processing tasks, letting you dynamically explore the training data, predictions, and generated output of language models.
In this guide, I generate Shakespearean prose via a character-based RNN in Pytorch to illustrate common use cases in exploratory natural language generation. You'll learn how to:
Let's start with that first point:

View incremental training logs

This section will show you how to log and browse a model's predictions over time to get an overall sense for performance improvement during training.
Here I log the model's continuation of the input "Wh" every 100 epochs, alongside the loss. Over time, a higher proportion of the generated letter-sequences are actual dictionary words, character names in caps become more recognizably Shakespearean, and some of the phrases start to make a little bit of sense.
An early sample reads "When to the burther, thou not week, Of be the print is de-comer-sumerence speating." (So: not ideal.) A much later sample gets to some phrase-level coherence: "What advancements? WARWICK: I know you not lies it is kneel. Come, time the base."
You can scroll through sample text below, filter based on loss or epoch, or view the same Table in the Artifacts context.
Top: 10K epochs, sampled every 500. Bottom: 8K, sampled every 100.

Sample code to log text to a wandb.Table

There are several ways to save a Table to W&B:
Here's how:

run.log({"my_table_name" : wandb.Table(data=data, columns=cols)})

The run.log() path, with key additions highlighted in bold:
# create a wandb run and store experiment settings# and hyperparameters in the config objectrun = wandb.init(project="nlg", job_type="train")wandb.config.update(...)# [initialize models]# create a Table with columns for the epoch index,# the model loss, and the generated texttext_table = wandb.Table(columns=["epoch", "loss", "text"])print("Training for %d epochs..." % args.n_epochs)for epoch in tqdm(range(1, args.n_epochs + 1)): loss = train(*random_training_set(args.chunk_len, args.batch_size)) loss_avg += loss # log loss to wandb wandb.log({"loss" : loss}) if epoch % args.print_every == 0: print('[%s (%d %d%%) %.4f]' % (time_since(start), epoch, epoch / args.n_epochs * 100, loss)) train_text = generate(decoder, 'Wh', 100, cuda=args.cuda) # log data row to Table text_table.add_data(epoch, loss, train_text) print(train_text, '\n')# log the table to wandbrun.log({"training_samples" : text_table})

artifact.add(my_table)

The slightly longer artifact.add() path is below. You can also see the code which created any particular Artifact by clicking on the "Overview" tab for that Artifact, then the name of the run which created this Artifact next to "Output by", then the code tab for that run (bottom left icon, "{ }").
# create a wandb run and store experiment settings# and hyperparameters in the config objectrun = wandb.init(project="nlg", job_type="train")wandb.config.update(...)# [initialize models]# create an Artifact to version the training predictions loggedat = wandb.Artifact("train_charnn", type="train_samples")# create a Table with columns for the epoch index,# the model loss, and the generated texttext_table = wandb.Table(columns=["epoch", "loss", "text"])print("Training for %d epochs..." % args.n_epochs)for epoch in tqdm(range(1, args.n_epochs + 1)): loss = train(*random_training_set(args.chunk_len, args.batch_size)) loss_avg += loss # log loss to wandb wandb.log({"loss" : loss}) if epoch % args.print_every == 0: print('[%s (%d %d%%) %.4f]' % (time_since(start), epoch, epoch / args.n_epochs * 100, loss)) train_text = generate(decoder, 'Wh', 100, cuda=args.cuda) # log data row to Table text_table.add_data(epoch, loss, train_text) print(train_text, '\n')# log the table to wandbat.add(text_table, "training_samples")run.log_artifact(at)

Save and reload models seamlessly

Save trained models as artifacts

Next, we'll cover how to store models in wandb during training by saving the architecture & weights to a local file and logging this file as an artifact.
The file format and syntax to save a model vary by deep learning framework, but wandb is framework agnostic: you can add any directory or file format to an artifact.
def save(run, save_model_filename): # save model locally using torch torch.save(decoder, save_model_filename) # create an artifact to track this model model_at = wandb.Artifact(save_model_filename, type="charnn_model") # add the model file model_at.add_file(save_model_filename) # log artifact to wandb run.log_artifact(model_at)

Load trained models for inference

To reload the model, use the name of a previously saved model and specify the version, numerically or using an alias. For example, python evaluate.py -m base_shakespeare.pt:v3 loads version 3 of the base_shakespeare model, while python evaluate.py -m base_shakespeare.pt:latest loads the last/most recent version logged to the base_shakespeare.pt model artifact.
run = wandb.init(project="nlg", job_type="evaluate")model_at = run.use_artifact(args.model_filename, type='charnn_model')# acquire a local copy of the modelmodel_dir = model_at.download()# extract model filenamemodel_filename = os.listdir(model_dir)[0]# load model using torchdecoder = torch.load(os.path.join(model_dir, model_filename))

Organize generated output

Browse text samples

Now, we'll load a previously trained model, sample it (e.g. with a list of prompts), and organize the generated text in wandb.Table.
In this example, I vary the temperature to adjust the randomness of the output. You can hover over a line to see the full text, and scroll down in the panel, filter by temperature values, or paginate in the top right corner to see more examples. This model's continuity across line breaks is fairly weak, but short phrases and sometimes whole utterances sound realistic. At the very least, some of the made-up words are delightful: "Shall I compare thee to a man? Thou art a babbility" (punctuation mine).

Dynamically group by relevant fields

Group by different columns to get a sense of your data at a glance. Page through responses to different prompts for the same temperature value (T) to get a sense of how much randomness to add. For this particular model, 0.1 ends quickly and 0.2 is too boring, but above 0.5 yields made-up words. Below, you can choose the pagination scheme: 1, 2, 3, 5, or 10 sections per cell to find patterns across prompts for a given temperature.
Group by prompt to scroll through the corresponding responses as temperature increases. The phrase coherence degrades quickly as temperature rises.
Another possible view: edit page size to generalize across variations on the same prompt

Sample code to log generated text

Adding a fixed id string to each generated sample—here, a unique identifier for each possible combination of temperature and prompt—will let us compare precisely across models.
prompts = load_text(args.prompts)temperatures = [0.3, 0.6, 0.9]responses = [[generate(decoder, prime_str=p, temperature=t) for p in prompts] \ for t in temperatures]sample_text_table = wandb.Table(columns=["id", "prompt", "temperature", "response"])for i, t in enumerate(temperatures): temp_responses = responses[i] for j, (p, r) in enumerate(zip(prompts, temp_responses)): # store a unique id for this temperature x prompt combination _id = str(i) + "_" + str(j) sample_text_table.add_data(_id, p, t, r)run.log({"samples" : sample_text_table})

Explore to compare model performance

You can compare across logged tables to sample and explore generated text from different model variants. I've run some quick experiments to generate different variants of my baseline model:

Align generated text across model variants

To compare two or more matching Tables, log them to the same name or key ("hidden_layer_samples" below). To compare specific sampling conditions, you can log an id for each row: here, a unique identifier for every combination of prompt and temperature, fixed across models. You can then join multiple Tables on id to compare models very precisely. This method of merging Tables mode should be the default, or you can specify it via the gear icon in the top right of a Table panel.
In this example, the model indexed 0 (logging the top line in each row) has a hidden size of 400, and the model indexed 1 (bottom line) has the baseline hidden size of 200. Across prompts and temperatures, the narrower model seems to be much more repetitive, duplicating adjacent words and phrases like "the way" for different prompts. The hs_400 model yields more diverse output, though both are fairly low in long-term coherence.
Live example →
A similar view is configurable from the Artifacts context. Here the model with the blue highlight (v15 latest, baseline) corresponds to the first/top line in each row, indexed 0, and the yellow highlight (v14 4_layers) to the second/bottom line in each row, indexed 1.
"O Which the sentence with the prince" is implausible, while "O Sirrain, the duke and the maid of his heart,..." is plausible if Sirrain is a term of address.

Align settings across model variants

To see two tables side-by-side, change the expression in the top left—append ".table.rows"—and convert the type to "Row.Table" in the top right corner. Now I can group by prompt to scroll through all the text generated for that prompt as temperature increases. This lets me match settings across two model versions, compare specific samples, and get a sense of the qualitative difference between the two models (and which directions to explore/which hyperparameters to vary next).
Here I compare the model evaluations from a baseline 2-layer CharRNN model (left vertical panel below) with a 4-layer version of the same model (right panel). As I scroll through the sample generated text for the same prefix and temperature from both models, I notice that the 4-layer version continues the prompt more smoothly and realistically. The baseline often starts a new phrase entirely, while the 4-layer uses an appropriate part of speech and even capitalization to continue the line.
A live example in the Artifacts context→
"our life, exempt from more was a store the way to him with the way" is slightly less poetic than "our life, exempt from the store the deep in the book of the courtesion of the sea"

Next steps

This is a brief preview of what you can do with text in Tables for dataset and prediction visualization. We are actively working on this feature to make it even more powerful and easier to use. In future reports, I plan to extend this analysis to more serious language generation models, datasets, and applications.
If you have any feedback or questions, please post them in the comments section below. Head over to the Tables docs for additional information or check out this quick MNIST colab if you'd like a simple walkthrough on a comfortable dataset. And thanks for reading!

Want to read more about Tables?

Report Gallery