Skip to main content

Cerebras Achieves 3x Speed Boost for Llama 3.1-70B AI Inference

The fastest LLM chip just 3x'd
Created on October 25|Last edited on October 25
Cerebras Systems has unveiled a major update to its AI inference platform, boosting the speed of Llama 3.1-70B to 2,100 tokens per second. This breakthrough makes the platform eight times faster than top GPUs running smaller models and sixteen times faster than other optimized GPU solutions. By unlocking this new level of performance, Cerebras enables real-time AI applications across industries.

Industry Impact

Pharmaceutical leader GlaxoSmithKline (GSK) is already leveraging the upgrade. Kim Branson, GSK’s SVP of AI and ML, shared that the speed increase allows the development of intelligent research agents, accelerating drug discovery and enhancing productivity. Voice AI applications also benefit, with LiveKit integrating Cerebras to reduce latency in ChatGPT’s voice mode. CEO Russ d’Sa noted that Cerebras has transformed inference from the slowest part of their workflow into the fastest, allowing developers to build voice AI that responds instantly and accurately.

Benchmarking And Technical Improvements

Artificial Analysis, an independent benchmarking firm, confirmed Cerebras’ unmatched performance. The platform completes complex workflows in real time, enabling models like GPT-o1 to conduct deep reasoning without delay. This speed boost is driven by software optimizations and the introduction of speculative decoding, with no compromise to accuracy or model precision.


Future Outlook

Cerebras' improvements are already shaping the next generation of AI tools by making real-time reasoning and responsiveness achievable for developers. The company plans to continue refining its platform and expanding capabilities, offering developers access through chat or API at inference.cerebras.ai.
This latest upgrade demonstrates Cerebras' commitment to unlocking the full potential of its Wafer Scale Engine through software, setting new benchmarks for speed and efficiency across AI applications.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.