How to Optimize GPU Cost with Weights and Biases

Created on September 29|Last edited on September 29
Comment
﻿
Scaling organizationally with W&BTo manage experiments, and ensure the highest quality model makes it into production, we utilize Weights & Biases to support all members of our AV organization. This means our Data Engineer, Data Scientists, ML Engineer, and ML Manager will all benefit from this instrumentation with minimal change to the tools they were already using. As we scale the scope of experimentation may involve other tasks and, inadvertently, other team teams with certain computational requirements.
Hardware Utilization by Persona and ProgramIn our organization, we may utilize a variety of hardware resources that come in a range of computational power and cost. As a result for our organization, it makes sense to ensure that we are able to efficiently utilize each of these resources when used and to be able to switch between resources with ease. Thankfully, W&B automatically collects GPU utilization metrics from the resources used, also allowing us to better plan GPU resources we may want to have for our project.
﻿
﻿
In fact, you can take a look at our training run's utilization below👇🏼. We tend to hover around ~70-80% utilization.
﻿
Run set56
﻿
Analyzing Experimentation Cost with Weights and Biases4 x T4's cost + n1-standard-16 (16 vCPUs & 60GB RAM) 👉🏽 $1,306.12 monthly estimate 👉🏽 $1.789 hourly
Our cost for our production training runs in USD would be
﻿
project("l5-demo", "l5-prediction-trials").runs.runtime.sum / 360 * 1.789
61.099
1 X T4 cost + n1-standard-8 (8 vCPUs & 30GB RAM) 👉🏽 $446.10 monthly estimate 👉🏽 $0.611 hourly
Our cost for our production evaluation runs in USD would be
﻿
project("l5-demo", "l5-prediction").runs.filter((row) => row.name == "evaluate-latest-models").runtime.sum / 360 * 0.611
6.497
No GPU + n1-standard-4 (4 vCPUs & 15GB RAM) 👉🏽 $119.57 monthly estimate 👉🏽 $0.164 hourly
We could sum up all the runtimes from every run involved in our project but it does not realistically account for the time developing the code
~1 week was spent developing this so our cost for our development in USD would be:
﻿
24 * 7 * 0.164
27.552
Artifact Flow by Persona and ProgramDuring and after each completed programmatic step, we expect our program to generate information that is useful:
context for the execution of that program at that time and/or 
for versioning of relevant inputs and outputs for every part of our ML workflow
﻿
Not only that, we have a very specific trace of every related step and artifact as this makes the path to the orchestration of our ML projects straightforward. The best part? The code will change minimally going forward. Even better? It takes ONE line of code in our Ray training runs to grab 1->inf training experiments.
﻿
Add a comment