Cerebras Systems launches the world’s fastest AI inference

Cerebras Systems has announced the launch of Cerebras Inference, offering a 20x performance improvement over traditional GPUs at a fraction of the cost.
Dave Davies
Created on August 27|Last edited on August 27
Comment
Cerebras Systems has announced the launch of Cerebras Inference. This new solution redefines the standards for speed, cost, and accuracy in AI inference, offering a 20x performance improvement over traditional GPUs at a fraction of the cost.
Weights & Biases is proud to be closely involved with this breakthrough, providing critical experiment tracking and evaluation tools that integrate seamlessly with Cerebras's infrastructure. We were fortunate to talk with their CEO on our latest podcast, build a native W&B Weave integration and got a sneak peak of how their solution benchmarked against competitors. 
Here’s what you should know:  
Native Weave Integration with the Cerebras SDKAs a launch partner, Weights & Biases Weave supports the Cerebras SDK with a native, auto-logger integration from day zero. This means that when you use Weave with the Cerebras SDK the inputs, metadata and outputs of every Cerebras call will be automatically logged with just 1 line of code. See the code below to get started and the Weave documentation to learn more.
﻿
﻿
import os
import weave
from cerebras.cloud.sdk import Cerebras
﻿
# Add 1 line of weave code to turn on auto-logging
weave.init("cerebras_speedster")
﻿
# Then use the Cerebras SDK as usual
api_key = model = "llama3.1-8b" 
client = Cerebras(api_key=os.environ["CEREBRAS_API_KEY"])
﻿
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "What's the fastest land animal?"}],
)
﻿
print(response.choices[0].message.content)
﻿
Gradient Dissent: Andrew Feldman discusses AI inference technologyJoin Andrew Feldman, CEO of Cerebras Systems and Weights & Biases CEO and Co-Founder Lukas Biewald as they discuss the latest advancements in AI inference technology on the latest episode of Gradient Dissent.
﻿
In the episode, they explore Cerebras Systems' new AI inference product, examining how their wafer-scale chips are setting new benchmarks in speed, accuracy, and cost efficiency. Andrew shares insights on the architectural innovations that make this possible and discusses the broader implications for AI workloads in production.
Comparing Cerebras to other LLM service providersWe had a chance to kick the tires on an early preview over the weekend so of course we wanted to test how well their chips stacked up. We chose a problem-solving code task with Llama 3.1 70B instruct and compared Cerebras against Groq, TogetherAI, Fireworks, and Okto. We didn’t dig into model performance on the task itself as all providers were running Llama 3.1 70B and focused instead on latency. 
﻿The results here were impressive and we recommend digging into the details if you’re curious about trying their solution. Cerebras was indeed faster than the providers we compared it against in our first tests and we’re excited to spend more time tinkering in the coming months. 
Exploring the impact of Cerebras InferenceCerebras Inference represents a step forward in AI inference, offering performance that is 20 times faster than existing GPU-based solutions. According to industry benchmarks, Cerebras Inference delivers 1,700 tokens per second for Llama 3.1 8B and 450 tokens per second for Llama 3.1 70B, making it the fastest option currently available.
﻿
This new solution also addresses the critical need for cost efficiency and accuracy. Cerebras Inference maintains 16-bit precision throughout the entire inference process, ensuring state-of-the-art performance without compromising accuracy.
Cerebras Inference is powered by the Cerebras CS-3 system and the Wafer Scale Engine 3 (WSE-3), which delivers the memory bandwidth and processing power needed for today’s most demanding AI workloads. By eliminating the trade-offs typically associated with GPU-based solutions, Cerebras Inference allows developers to build and deploy AI models with increased speed and accuracy.
Join Cerebras Systems and Weights & Biases at the MLOps South Bay meetupIf you’re a Bay Area local, please join us at our MLOps South Bay Meetup on September 19 in Mountain View, CA. We’ll be joined by Daniel Kim, Head of Developer Relations at Cerebras, and Atharva Talpade, Product Manager at Cerebras and they’ll provide insights into both the technical specs behind Cerebras Inference and its applications.
Weights & Biases Machine Learning Engineer, Anish Shah, will also present on fine-tuning and evaluating multimodal large language models using Weights & Biases. Come learn what’s new, get hands-on training from ML experts, and stick around to network. Space is limited, so please register early to secure your spot.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.