This section shows how our continuous batching implementation improves GPU memory utilization
Here, the key graphs are the first 2: Stream Utilization and GPU memory usage.
The top left (stream utilization) represents the % of prompt generations in the stream that are actively generating tokens and not finished
We see that in static batching, the stream starts full, then as various generations in the stream complete, the stream utilization decreases, until all generations in the stream are complete and we refill
In the second graph (gpu memory usage %), we see that static batching has these large memory spikes and then drops down, whereas continuous batching has a consistent stream usage.
gpu memory usage (%)
gpu memory usage (%)
Select runs that logged gpu memory usage (%) to visualize data in this line chart.
tokens_per_second
tokens_per_second
Select runs that logged tokens_per_second to visualize data in this line chart.
stream utilization (%)
stream utilization (%)
Select runs that logged stream utilization (%) to visualize data in this line chart.
Run set
Acceptance Rates across Generations
Note: each level represents the average Ngram acceptance rate for each generation of a given prompt (so sequentially training, we have step 0 which is the first generation, and step 7 which has all the previous generations in training data.
We see a steady increase in acceptance rates as we move to another generation level, which shows our sequentially trained ngram models are effective