Is the new Cerebras API the fastest LLM service provider?

Let's compare five different Llama 70B providers and run some benchmarks. We'll be looking at Cerebras, Groq, Together, Fireworks, and Octo.
Thomas Capelle
Created on August 26|Last edited on August 28
Comment
Last week, I had access to the early preview of the Cerebras API service. Cerebras is known for making wafer-sized chips (many times larger than the competition) that can hold enormous amounts of on-chip memory, thus providing the speed and low latency needed to make LLMs run fast. At the time of testing, the best available model was Llama 3.1 70B so that's what we used to check how different LLM serving providers perform at serving this model.
Explore the evaluation﻿
﻿﻿﻿﻿﻿
TLDRIn this post, we'll talk about the task we used for comparison and look at how Cerebras stacked up against Groq, Together, Fireworks, and Octo. We first did a speed test (see the W&B Weave screenshot below) and then compared these via the 2023 NeurIPS Hacker Cup, an advanced code-based problem solving competition. We'll get into further detail in the next section. 
We saw Cerebras consistently at 370 tokens per second, with Groq as its nearest neighbor at 226. Cerebras's latency was also impressive: it finished our task in roughly 30 seconds, which was more than twice the speed of any other solution we tested. We did find that some providers were more reliable than others; some APIs had issues with rate limiting or their service was down at the time of testing.
Cerebras are serving LLama 70b at almost 370 tokens per second on average compared to other popular LLM providers
The tests were done using a simple inferencing script and averaged over 10 trials. The measured time is on the consumer's end. You can check the benchmarking code here.﻿
💡
Table of contentsTLDRLeveraging speed to solve complex problems: NeurIPS Hacker Cup AIHow we evaluated on NeurIPS Hacker CupHacker Cup evaluation resultsLearn more
﻿
Leveraging speed to solve complex problems: NeurIPS Hacker Cup AIThe 2024 NeurIPS Hacker Cup AI is a great test bed for comparing LLMs. The problems are complicated and may require iteration to provide a working solution. It also tests the reliability and limits of the providers, as you will make multiple calls, ideally simultaneously.
As you can solve each problem individually and there is a 6 minute time limit to complete each round of problems, you should attempt solving them in parallel. This provides another test of the LLM service providers' capabilities.
The 2024 NeurIPS HackerCup AI starts in a few weeks, and we have been prepping for it! It consists of some of the most complex coding problems, and this year has an exceptional track for AI models to prove their worth.
We hosted a series of lectures with class experts on valuable techniques for building automated solutions for competitions: Click here to watch the videos.﻿
💡
We're going to use the 2023 NeurIPS Hacker Cup as the basis for our comparison. If you want to learn more about that competition or the Deeply Understand the Problems (DUP) strategy we implemented, you can expand any of the sections below. You can check out a sample task from that competition here.
But we figure you want to see the evaluations, so we'll get you there now. 
How we evaluated on NeurIPS Hacker Cup
Hacker Cup evaluation resultsLet's try solving the practice round from 2023. This consists of five problems. We will use a valid solution from one of the problems as a few-shot example, so we will only solve four. After generating the code solution, we will test two things: 
If the produced code runs (without errors) and produces an output file
If the output file matches the sample_output, we could check against the actual competition output, but this is too hard.
The Llama 3.1 70b model is not that strong in code generation and in reasoning in general compared to the SoTA models like Claude Sonnet or GPT4o, or even more specialized code models like DeepSeek code or Codestral, so take these results as they are.
W&B Weave's evaluation comparison tool is a perfect way to do this. You can explore this project in our product by following this link.﻿
﻿
Check this comparison here﻿
We are not comparing model performance, as all providers served a flavor of Llama 3.1 70B instruct and limit your interactions with the API in similar ways (they restrict the tokens per time unit (minutes/hour/day) and the number of calls). When working with this agentic or iterative pipeline, you will probably hit those limits, so to submit under the 6-minute rule, be sure to benchmark your pipeline thoroughly.
﻿
Let’s analyze the evaluation on two fronts: solving the actual task at hand and doing so reliably and quickly.
Most of the models that completed the evaluation managed to produce runnable code and solve 1 out of 4 problems:
Cerebras: Generated runnable code for 4 out of 4 problems, with 1 out of 4 problems solved. It was considerably faster, completing the evaluation in 30 seconds. Note: The current model’s context length is 8k.
Groq: Produced 1 out of 4 outputs, with 1 out of 4 runnable code, no problems solved. Due to rate limits, we couldn’t finish the entire evaluation.
Fireworks: Produced runnable code for 4 out of 4 problems, with 1 out of 4 problems solved. It completed the entire evaluation in 4 minutes and 53 seconds. Due to rate limits, the evaluations were done sequentially, with each individual problem taking around 72 seconds.
Together: Generated runnable code for 4 out of 4 problems, solving 1 out of 4 problems. 
Octo: Produced runnable code for 4 out of 4 problems, with 1 out of 4 problems solved. Its performance was slightly slower than  Fireworks, with 89 seconds per evaluation.
As capacity rises, most of these rate and concurrency limits are expected to disappear in the coming months, particularly for business accounts.
Learn more
Cerebras Systems launches the world’s fastest AI inference
Cerebras Systems has announced the launch of Cerebras Inference, offering a 20x performance improvement over traditional GPUs at a fraction of the cost.
Gradient Dissent Podcast with Cerebras CEO Andrew Feldman
In this episode of Gradient Dissent, Andrew Feldman, CEO of Cerebras Systems, joins host Lukas Biewald to discuss the latest advancements in AI inference technology.
﻿
﻿
﻿
﻿
Add a comment
Angela Yeung • 1 year ago
In the Scorecard section, what does it mean that it's "Really having a hard time producing solutions"?
Tags: Articles, GenAI, LLM, Experiment, Hardware
Iterate on AI agents and models faster. Try Weights & Biases today.