Our latest feature for Dataset and Prediction Visualization now extends to natural language processing tasks so you can dynamically interact with the training data, predictions, and generated output of language models. In this guide I generate Shakespearean prose via a character-based RNN in Pytorch to illustrate common use cases in exploratory natural langauge gen

View incremental training logs

Log your model's predictions over time to get an overall sense for performance improvement during training. Here I log the model's continuation of the input "Wh" every 100 epochs, alongside the loss. Over time, a higher proportion of the generated letter-sequences are actual dictionary words, character names in caps become more recognizably Shakespearean, and some of the phrases start to make a little bit of sense.
Interact with a live example →
Early sample: "Of be the print is de-comer-sumerence speating" Late sample: "What advancements? WARWICK: I know you not lies it is kneel. Come, time the base"

Sample code to create a text table

Key additions to the original char-rnn.pytorch training code are highlighted in bold. You can also see the code which created any particular Artifact by clicking on the "Overview" tab for that Artifact, then the name of the run which created this Artifact next to "Output by", then the code tab for that run (bottom left icon, "{ }").
# create a wandb run and store experiment settings# and hyperparameters in the config objectrun = wandb.init(project="nlg", job_type="train")wandb.config.update(...)# [initialize models]# create an Artifact to version the training predictions loggedat = wandb.Artifact("train_charnn", type="train_samples")# create a Table with columns for the epoch index,# the model loss, and the generated texttext_table = wandb.Table(columns=["epoch", "loss", "text"])print("Training for %d epochs..." % args.n_epochs)for epoch in tqdm(range(1, args.n_epochs + 1)): loss = train(*random_training_set(args.chunk_len, args.batch_size)) loss_avg += loss # log loss to wandb wandb.log({"loss" : loss}) if epoch % args.print_every == 0: print('[%s (%d %d%%) %.4f]' % (time_since(start), epoch, epoch / args.n_epochs * 100, loss)) train_text = generate(decoder, 'Wh', 100, cuda=args.cuda) # log data row to Table text_table.add_data(epoch, loss, train_text) print(train_text, '\n')# log the table to wandbat.add(text_table, "training_samples")run.log_artifact(at)

Save and reload models seamlessly

Save trained models as artifacts

Store models in wandb during training by saving the architecture & weights to a local file and logging this file as an artifact. The file format and syntax to save a model vary by deep learning framework, but wandb is framework agnostic: you can add any directory or file format to an artifact.
def save(run, save_model_filename): # save model locally using torch torch.save(decoder, save_model_filename) # create an artifact to track this model model_at = wandb.Artifact(save_model_filename, type="charnn_model") # add the model file model_at.add_file(save_model_filename) # log artifact to wandb run.log_artifact(model_at)

Load trained models for inference

To reload the model, use the name of a previously saved model and specify the version, numerically or using an alias. For example, python evaluate.py -m base_shakespeare.pt:v3 loads version 3 of the base_shakespeare model, while python evaluate.py -m base_shakespeare.pt:latest loads the last/most recent version logged to the base_shakespeare.pt model artifact.
run = wandb.init(project="nlg", job_type="evaluate")model_at = run.use_artifact(args.model_filename, type='charnn_model')# acquire a local copy of the modelmodel_dir = model_at.download()# extract model filenamemodel_filename = os.listdir(model_dir)[0]# load model using torchdecoder = torch.load(os.path.join(model_dir, model_filename))

Organize generated output

Browse text samples

Interactive example →
Load a previously trained model, sample it (e.g. with a list of prompts), and organize the generated text in wandb.Table. In this example, I vary the temperature to adjust the randomness of the output.
This model's continuity across line breaks is fairly weak

Dynamically group by relevant fields

Group by different columns to get a sense of your data at a glance. Page through responses to different prompts for the same temperature value to get a sense of how much randomness to add. For this particular model, 0.2 is too boring, but above 0.5 yields made-up words.
Example with a broader temperature range →
Choose the pagination scheme: 1, 2, 3, 5, or 10 sections per cell
Group by prompt to scroll through the corresponding responses as temperature increases. The phrase coherence degrades quickly as temperature rises.

Sample code to log generated text

prompts = load_text(args.prompts)temperatures = [0.3, 0.6, 0.9]responses = [[generate(decoder, prime_str=p, temperature=t) for p in prompts] \ for t in temperatures]samples_at = wandb.Artifact(name="charnn_samples", type="prompt_response")text_table = wandb.Table(columns=["prompt", "temperature", "response"])for i, t in enumerate(temperatures): temp_responses = responses[i] for j, (p, r) in enumerate(zip(prompts, temp_responses)): text_table.add_data(p, t, r)at.add(text_table, "samples")run.log_artifact(at)

Explore to compare model performance

You can compare across logged tables to sample and explore generated text from different model variants. I've run some quick experiments to generate different variants of my baseline model:

Align generated text across model variants

Live example →
Here I compare the model evaluations from a baseline 2-layer CharRNN model (blue highlight, alias 0 in the table, top row of each cell) to a 4-layer version of the same model (yellow highlight, alias 1 in the table, bottom row). As I scroll through the sample generated text for the same prefix and temperature from both models, I notice that the 4-layer version continues the prompt more smoothly and realistically. The baseline often starts a new phrase entirely, while the 4-layer uses an appropriate part of speech and even capitalization to continue the line.
"O Which the sentence with the prince" is implausible, while "O Sirrain, the duke and the maid of his heart,..." is plausible if Sirrain is a term of address.

Align settings across model variants

Live example→
Switch the view to "SplitPanel" to see two tables side-by-side. Now I can group by prompt to scroll through all the text generated for that prompt as temperature increases. This lets me match settings across two model versions, compare specific samples, and get a sense of the qualitative difference between the two models (and which directions to explore/which hyperparameters to vary next). In this example, the model on the left (blue highlight) has a hidden size of 200, and the model on the right (yellow highlight) has a hidden size of 400. Across prompts and temperatures, the narrower model seems to be much more repetitive, duplicating adjacent words and phrases like "the way" for different prompts. The hs_400 model yields more diverse output, though both are fairly low in long-term coherence.
"our life, exempt from more was a store the way to him with the way" is slightly less poetic than "our life, exempt from the store the deep in the book of the courtesion of the sea"

Next steps

This is a brief preview of what you can do with text in Dataset and Prediction Visualization. We are actively working on this feature to make it even more powerful and easier to use. In future reports, I plan to extend this analysis to more serious language generation models, datasets, and applications. If you have any feedback or questions, please post them in the comments section below.