Best Practices for Logging data for Foundational Model Builders

Follow these tips and tricks to not be slowed down as you scale your model training with Weights & Biases
Abraham Leal, Uma Krishnaswamy, Travis Depuy, Wyler Zahm, Venky Yerneni, Mohammad Bakir
Created on February 11|Last edited on September 3
Comment
﻿
﻿
The W&B SDK is extremely flexible and performant, with that flexibility comes great responsibility: What is the best way to utilize W&B for a large number of metrics in order to scale with my training? Read on to learn all the tips and tricks. 
 This guidance applies when using the SDK version 0.21.3 and higher. If you are on an older version, please upgrade! The SDK becomes more performant and stable with each release.
💡
Configuration ⚙️There are various configurations for the SDK that allows users to scale with their experiments and work with their environments. Most configuration is available to be set as an environment variable or through wandb.Settings in the wandb.init call:
WANDB_X_FILE_STREAM_MAX_LINE_BYTES env variable: The W&B SDK by default maxes out the data it transfers on a single run.log call to 10MB.  To go over this limit, set this env var to the limit you'd like (in bytes) to have in order to send more metrics per log line. Of course, it all comes with trade offs: The more data you send in a single call, the slower the processing for that specific call in upper scales of model training. We recommend splitting data if possible in multiple run.log calls before using this method.
WANDB_INIT_TIMEOUT env variable: Sometimes when using the SDK serially, with context manager runs, or through serial multiprocessing, you may find the init action to take longer due to the underlying process ensuring the last data is sent in an ordered manner. Bumping the init time out will help give your script a longer time to ensure this sync up is working well.
WANDB_X_GRAPHQL_TIMEOUT_SECONDS setting: Sometimes, because of the large volume of data sent by the sdk, some background processes time out due to server-side backend processing times. It is good practice to bump this graphql timeout to allow for the backend to have a longer time to respond to requests.
WANDB_DEBUG env variable: This one isn't necessary to set all the time, but enabling this will have verbose logging, which allows W&B to rapidly diagnose if a bug exists in the SDK. This one is needed less frequently, as the SDK is quite battle tested.
Logging Patterns 🤖How you log matters, and ensuring you are making it so that it is the best pattern possible for the network will make sure to enable the fastest possible performance by W&B. Here are some considerations for how to setup your logging scripts:
Use define_metric in a limited scope: It is often the case that big model builders would like to log a custom step rather than the default W&B step. This is all cool, but commonly the wish is to also represent all graphs with that step, so eventually a line like this shows up: run.define_metric("*", step_metric="global_step"). At higher scales (around 100k metrics per log line), this is detrimental to the front end for two reasons:
The underlying logic makes the run table loading much slower
This specific line makes the run metadata so large, that it hits its limit of 15MB.
	If you'd like to set a default X-Axis, we recommend using Workspace settings instead. 
As of wandb==0.19.10 we are previewing removing this limitation through specific configuration. Contact W&B for more information.
💡
In general, if there are metrics in your workflow you know you won't be viewing later, you can set the hidden=True argument within run.define_metric() to automatically move the panel(s) to the hidden section of the UI. 
import wandb
﻿
run = wandb.init()
﻿
run.define_metric("custom_step", hidden=True)
run.define_metric("loss", step_metric="custom_step")
﻿
for i in range(10):
    run.log({"custom_step": i**3, "loss": i})
﻿
run.finish()
Batch metrics up to a certain point: Sometimes researchers wonder if its more efficient if its better to send a single step at a time, or multiple steps over a period. The answer is it depends: in smaller scales this is more efficient, but its a bell curve, as things progress in higher scale, this becomes detrimental to processing time. We recommend logging per-step as a default, unless you know your training will stay under 100k metrics per step. Ensure your per-step logging to stay under 10MB.
Sparse metrics logging: It is often the case that a percentage of metrics are more important than others. We highly recommend the pattern of sparse metrics logging. This allows the front end to only render metrics at the valuable steps, rather than at every step. For example: We commonly see the lower value metrics be logged every 100 steps.
Only log runs in the same project if they need to be compared to each other: Logging runs to separate projects will help separate the amount of data points needed to be loaded by the front end, which results in faster front end performance.
Always log through the object returned by run = wandb.init(), this helps with consistency and will allow multi-threaded logging in the future. 
As of wandb==0.19.10 the SDK supports multiple active runs on the same process.
💡
Environment Considerations 💻The SDK utilizes a highly performant sidecar service in order to log metrics at blazing fast speed. This service is heavily improved in SDK versions 0.18+ (so please upgrade!). Logging in this way to not interrupt the user process is the best way to ensure your data gets delivered safely, but it comes with some considerations:
Ensure your environment can accommodate this extra service. It is represented as a system process called wandb-core and it will be responsible for network transfers to W&B.
Some times some customers would like to avoid the service to ping back to W&B health information, such as in air-gapped environments. To accomplish that, please ensure to set the environment variable WANDB_ERROR_REPORTING to False 
Ensure you call run.finish() whenever finishing up reporting for a run -- it ensures the correct data and metadata is sent to the backend for correct logging.
﻿
Add a comment