Skip to main content

xAI launches Grok 4: The smartest LLM yet?

Created on July 10|Last edited on July 10
xAI, has launched Grok 4, its newest flagship model, claiming it achieves “PhD-level” academic performance across every subject. The release came during a livestream event Wednesday night, where Musk sat alongside xAI executives in a leather jacket and made bold statements about the model’s capabilities. He admitted that Grok 4 still “lacks common sense” and has yet to invent new technologies or uncover physical laws but framed those limitations as temporary. The company also introduced Grok 4 Heavy, described as a multi-agent version capable of generating multiple solutions in parallel and choosing the best answer through a group-like deliberation process.

Performance Benchmarks and Multi-Agent Architecture

xAI is emphasizing Grok 4’s academic and reasoning capabilities, citing performance on benchmark tests like Humanity’s Last Exam and ARC-AGI-2. On the former, Grok 4 without tools scored 25.4%, outperforming Google Gemini 2.5 Pro and OpenAI’s o3 high variant. With tools enabled, Grok 4 Heavy scored 44.4%. On the ARC-AGI-2 test, which measures pattern recognition and abstract reasoning, Grok scored 16.2%, nearly doubling the closest competitor, Claude Opus 4. These numbers suggest a significant leap in performance, particularly in structured academic tasks, though the benchmark context is limited to internal reporting for now.
Here are the benchmarks as reported by xAI:

GPQA

o3 (no tool): 83.3%
Gemini 2.5 Pro (no tool): 86.4%
Claude 4 Opus (no tool): 79.6%
Grok 4 (no tool): 87.5%
Grok 4 Heavy: 88.9%

AIME25

o3 (no tool): 88.9%
Gemini 2.5 Pro (no tool): 88.0%
Claude 4 Opus (no tool): 75.5%
Grok 4 (no tool): 91.7%
Grok 4: 98.4%
Grok 4 Heavy: 100.0%

LCB (Jan–May)

o3 (no tool): 72.0%
Gemini 2.5 Pro (no tool): 74.2%
Grok 4 (no tool): 79.0%
Grok 4: 79.3%
Grok 4 Heavy: 79.4%

HMMT25

o3 (no tool): 77.5%
Gemini 2.5 Pro (no tool): 82.5%
Claude 4 Opus (no tool): 58.3%
Grok 4 (no tool): 90.0%
Grok 4: 93.9%
Grok 4 Heavy: 96.7%

USAMO25

o3: 21.7%
Gemini 2.5 Pro: 34.5%
Claude 4 Opus: 49.4%
Grok 4: 37.5%
Grok 4 Heavy: 61.9%

Subscription Strategy and Enterprise Push

Alongside the model launch, xAI introduced its most expensive subscription to date: SuperGrok Heavy, priced at $300 per month. This plan gives subscribers early access to Grok 4 Heavy and upcoming models, including a coding assistant in August, a multi-modal agent in September, and a video generation tool in October. xAI is also opening access to Grok 4 via API and is courting enterprise adoption, despite its developer outreach efforts being less than three months old. The company plans to distribute the models through major cloud platforms, aligning with the enterprise strategy used by other foundation model players.

Challenges Ahead for Business Adoption

Despite benchmark wins, xAI faces significant reputational risks as it tries to position Grok as a credible alternative to ChatGPT, Gemini, and Claude. The recent incident with Grok’s behavior has raised questions about the safety and reliability of the system, especially for business users. Musk’s decision to downplay the controversy may further complicate xAI’s relationship with enterprise clients, many of whom prioritize content safety and responsible AI use. Whether Grok’s raw performance gains will be enough to offset those concerns remains uncertain. The coming months will likely determine if xAI can convert buzz into real market share.
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.