How to Optimize GPU Cost with Weights and Biases
Created on September 29|Last edited on September 29
Comment
Scaling organizationally with W&B
To manage experiments, and ensure the highest quality model makes it into production, we utilize Weights & Biases to support all members of our AV organization. This means our Data Engineer, Data Scientists, ML Engineer, and ML Manager will all benefit from this instrumentation with minimal change to the tools they were already using. As we scale the scope of experimentation may involve other tasks and, inadvertently, other team teams with certain computational requirements.
Hardware Utilization by Persona and Program
In our organization, we may utilize a variety of hardware resources that come in a range of computational power and cost. As a result for our organization, it makes sense to ensure that we are able to efficiently utilize each of these resources when used and to be able to switch between resources with ease. Thankfully, W&B automatically collects GPU utilization metrics from the resources used, also allowing us to better plan GPU resources we may want to have for our project.

In fact, you can take a look at our training run's utilization below👇🏼. We tend to hover around ~70-80% utilization.
Run set
56
Analyzing Experimentation Cost with Weights and Biases
- 4 x T4's cost + n1-standard-16 (16 vCPUs & 60GB RAM) 👉🏽 $1,306.12 monthly estimate 👉🏽 $1.789 hourly
- Our cost for our production training runs in USD would be
- 1 X T4 cost + n1-standard-8 (8 vCPUs & 30GB RAM) 👉🏽 $446.10 monthly estimate 👉🏽 $0.611 hourly
- Our cost for our production evaluation runs in USD would be
- No GPU + n1-standard-4 (4 vCPUs & 15GB RAM) 👉🏽 $119.57 monthly estimate 👉🏽 $0.164 hourly
- We could sum up all the runtimes from every run involved in our project but it does not realistically account for the time developing the code
- ~1 week was spent developing this so our cost for our development in USD would be:
Artifact Flow by Persona and Program
During and after each completed programmatic step, we expect our program to generate information that is useful:
- context for the execution of that program at that time and/or
- for versioning of relevant inputs and outputs for every part of our ML workflow

Not only that, we have a very specific trace of every related step and artifact as this makes the path to the orchestration of our ML projects straightforward. The best part? The code will change minimally going forward. Even better? It takes ONE line of code in our Ray training runs to grab 1->inf training experiments.
Add a comment