Hyperparameter Search with spaCy and Weights & Biases

Find the optimal hyperparameters for your spaCy project using W&B Sweeps. Made by Scott Condron using Weights & Biases
Scott Condron

Introduction 🧹

You probably don't need a pitch about why you should automate hyperparameter search. It's almost a rite of passage for machine learning practitioners to manually run a grid search over parameters and see how painful and error-prone it is.
Rather than trying every configuration manually, you can use W&B Sweeps to run in the background while you to focus on the more interesting work you have to do. In this tutorial, we'll show you how to use Sweeps to find the optimal hyperparameters for your spaCy project.
After running a sweep, you'll automatically get a bunch of useful plots that will tell you how important your hyperparameters are to the metric you care about.

What is spaCy and What is Weights & Biases? (Click to reveal if you haven't heard of us)

spaCy is a tremendous resource for your Natural Language Processing (NLP) needs. That includes Named Entity Recognition (NER), Part of Speech tagging, text classification, and a whole bunch more. And while spaCy works well out-of-the-box, their components are deeply customizable and composable.
Weights & Biases makes running collaborative machine learning projects a breeze. You can focus on what you're trying to experiment with, and W&B will take on the burden of logging and keeping track of everything. If you want to review a loss plot, download the latest model for production, or just see which configurations produced a certain model, W&B is your friend. There's also a bunch of features to help you and your team collaborate like having a shared dashboard and sharing interactive reports.

New to W&B + spaCy?

If you're new to using W&B with spaCy projects and would like to hear about all the other features W&B has to offer like experiment tracking, model checkpointing and dataset versioning, you can read about that here:
Report Gallery

TL;DR

To run a sweep, you have two options:

Using the Command Line Interface

  1. Run wandb sweep my_sweep.yml to initialize the Sweep.
  2. Run wandb agent sweep_id as given from wandb sweep to launch the agents which try the different hyperparameters. You can see an example config here.

Using the Python API

  1. Define wandb.sweep and pass in your Sweep config along with your project name and entity (username or team name).
  2. Call wandb.agent with that sweep_id and the function to train your model. You can see some example code here.
sweep_id = wandb.sweep(sweep_config, project="wandb_spacy_sweeps", entity='wandb')wandb.agent(sweep_id, train_spacy, count=20)

Try it out yourself using spaCy projects

Clone the project with spaCy projects:
python -m spacy project clone integrations/wandb
Install the dependencies:
python -m spacy project run install
Download the assets:
python -m spacy project assets
Run the hyperparameter search:
python -m spacy project run parameter-search

Set up your own spaCy project to use W&B Sweeps

Add the Weights & Biases integration to your spaCy config

First, add Weights & Biases to your project to track your experiments (and optionally your datasets/model versions). All you need to do is add a few lines to your project's config .cfg file. For more information, visit the spaCy integration page in our docs or read this blog post.
Add the following to your spaCy configuration file:
[training.logger]@loggers = "spacy.WandbLogger.v2"project_name = "your_project_name"remove_config_values = []log_dataset_dir = "./assets"model_log_interval = 1000
Note: log_dataset_dir is only necessary if you want dataset versioning and model_log_interval is only necessary if you want model checkpointing.
As with spaCy training, you can use either code or config files to manage sweeps. We'll show the config version below but you can see a version using code here.
We'll create a configuration YAML file ./scripts/sweep.yml and choose our search strategy along with the values for the hyperparameters we want to test.

Choose your Search Strategy

When using W&B Sweeps, it will choose the next hyperparameters to try based on a search strategy you define. There are a few options, each with its own tradeoffs.

Configure the Parameter Search

Now we need to define which hyperparameters to search and what options to give them.
To represent the spaCy config options, we need to use the dot notation that is used in spaCy configs. values is a list of options to try for that parameter.
parameters: components.textcat.model.conv_depth: values: - 2 - 3 - 4 components.textcat.model.ngram_size: values: - 1 - 2 - 3
You can see the full config file here. If you would prefer to define your sweep in a python script, see here.

Combine the Configurations from spaCy and W&B Sweeps

Thankfully for everyone, the code to combine both configs is tiny:
import typerfrom pathlib import Pathfrom spacy.training.loop import trainfrom spacy.training.initialize import init_nlpfrom spacy import utilfrom thinc.api import Configimport wandbdef main(default_config: Path, output_path: Path): loaded_local_config = util.load_config(default_config) with wandb.init() as run: sweeps_config = Config(util.dot_to_dict(run.config)) merged_config = Config(loaded_local_config).merge(sweeps_config) nlp = init_nlp(merged_config) train(nlp, output_path, use_gpu=True)if __name__ == "__main__": typer.run(main)
All we are doing here is loading a default local spaCy config (the configuration that has everything for your training script), merging it with the Sweep config (the hyperparameters we're changing) and then starting training.
You can see this in our spaCy integration code here.

Call the training code with Sweeps 🧹

Finally, we need to tell Sweeps what code to call with these hyperparameters. To do this, we define a custom command in the config YAML file like:
command: - ${env} - ${interpreter} - scripts/sweeps_using_config.py - ./configs/default_config.cfg - ./training
where ./scripts/sweep_using_config.py is the code which merges the configs that we defined above and configs/default_config.cfg is the config that has defined all of the other parameters you're not searching through.

Full Sweeps Config

Here's the entire Sweeps config:
./scripts/sweep.yml:
method: bayesmetric: goal: maximize name: cats_macro_auccommand: - ${env} - ${interpreter} - scripts/sweeps_using_config.py - ./configs/default_config.cfg - ./trainingparameters: components.textcat.model.conv_depth: values: - 2 - 3 - 4 components.textcat.model.ngram_size: values: - 1 - 2 - 3 training.dropout: distribution: uniform max: 0.5 min: 0.05 training.optimizer.learn_rate: distribution: uniform max: 0.01 min: 0.001
Note: We've chosen a bayes search strategy above which requires a metric to be defined to optimally choose the next hyperparmeters to try.
Now that we have defined our sweeps configuration file, we can call it using the wandb CLI like this:
wandb sweep my_sweep.yml
which will output a sweep-id which you should pass to wandb agent:
wandb agent sweep-id
You can call this on each machine or within each process that you'd like to contribute to the sweep.
And that's it. Now you can get back to the other, harder to automate parts of your job.

Conclusion

In this tutorial, you've seen how you can do hyperparameter search of your spaCy project to find the best hyperparameters for your training configuration.
By defining which hyperparameters we want to search over, the search strategy and adding a little 👌 code to play nicely with spaCy, we have a great way to tune your projects hyperparameters with W&B Sweeps. We work hard to make sure W&B works well with different frameworks so practitioners can spend more time on the interesting parts of their work. Thanks for reading!

Read Next:

Report Gallery