Code generation and debugging with the Grok 4 API

Grok 4 tutorial: A step-by-step guide to using xAI’s Grok 4 for code generation and debugging, including OpenRouter setup and W&B Weave integration.
Dave Davies
Created on July 10|Last edited on September 24
Comment
Grok 4, developed by xAI, is a cutting-edge AI model designed for advanced code generation and optimization. It offers a multimodal capability (processing text and images) and an expansive context window, making it a robust tool for developers working on complex projects. Grok 4’s design emphasizes deep reasoning, allowing it to tackle coding problems with a “think before responding” approach that enhances accuracy. This means it can generate code solutions and pinpoint bugs more effectively than many of its predecessors.
One of the key benefits of using Grok 4 is its ability to handle large codebases and intricate debugging tasks seamlessly. It surpasses popular models like OpenAI’s GPT and Anthropic’s Claude on several advanced benchmarks, demonstrating superior performance in reasoning and problem-solving. Unlike typical chatbots, Grok 4 is designed to assist developers by generating code, explaining logic, and resolving errors in a more streamlined manner.
Throughout this tutorial, we’ll explore how to set up Grok 4 via OpenRouter, use it for code generation, and leverage W&B Weave for evaluating outputs and monitoring the model’s behavior in real-time.
Table of contentsWhat is Grok 4?Grok 4 compared to other AI modelsGrok 4 pricingBase RatesComparison to other modelsCost-saving optionsTutorial: Code generation and observability with Grok 4 and W&B WeaveStep 1: Setting Up Your Environment: Grok 4, OpenRouter, and W&B WeaveStep 2: Generate Initial Code with Grok 4Step 3: Test and evaluate the generated code with WeaveStep 4: Debug and refine with Grok 4 (Observed with Weave)Step 5: Leveraging W&B Weave for deeper analysis and workflow improvementAlternative use cases for Grok 4Conclusion
﻿
What is Grok 4?Grok 4 is xAI’s flagship large language model, purpose-built to excel at complex tasks like programming assistance and critical reasoning. It represents a significant leap over earlier models (like Grok 3) in both scale and capability. Under the hood, Grok 4 employs a hybrid neural architecture with approximately 1.7 trillion parameters, which is orders of magnitude larger than many competing models. This massive scale, combined with specialized attention mechanisms for code and math, allows Grok 4 to understand and generate very intricate solutions. For example, it maintains a context window of up to 256,000 tokens, allowing it to consider extensive code files or documents when formulating responses. In practical terms, Grok 4 can read and reason about entire code repositories or lengthy problem descriptions without losing track of details.
In comparison to other AI models, such as GPT, Claude, or Google’s Gemini, Grok 4 stands out for its unique features and advanced capabilities. Notably, it’s a multimodal model, meaning it can handle both text and images as inputs. This opens the door to applications such as analyzing a code screenshot or a diagram and providing text-based answers. Additionally, Grok 4 introduces an advanced function-calling capability, enabling it to interface with external tools or APIs as needed.
This is similar to how one might extend ChatGPT with plugins, but Grok 4 has this tool-use mechanism built into its API, enabling it to perform actions like code execution or data retrieval as part of its reasoning process. For software developers, xAI provides a specialized variant called “Grok 4 Code”, which is tailored for integration with development environments (like the Cursor editor). Grok 4 Code goes beyond basic text completion – it can suggest code optimizations, help with debugging, and even recommend architectural improvements.
In short, Grok 4 is designed to be a developer’s AI assistant, combining the conversational capabilities of a chatbot with the technical expertise of a coding expert.
﻿
Grok 4's hybrid neural architecture and attention mechanism
Grok 4 compared to other AI modelsGrok 4’s performance on standard benchmarks places it at the forefront of current AI models. In a suite of evaluations, xAI reported that Grok 4 achieved top-tier results across diverse domains. 
For instance:
In the AIME (American Invitational Math Exam), Grok 4 achieved a near-perfect score (100%), a dramatic improvement over Grok 3’s performance and even surpassing the averages of human experts.
It also demonstrated exceptional scientific reasoning, scoring 87% on a graduate-level physics Q&A test (GPQA), compared to 75% by its predecessor. These figures demonstrate a profound understanding of complex domains, resulting in more effective code generation in scientific or mathematical computing contexts.
On a coding-specific benchmark (SWE-bench), Grok 4 scored around 72–75%, highlighting its strong capabilities in code generation and debugging tasks. Such scores suggest that in many programming challenges, Grok 4 can outperform other models in both accuracy and reliability.
Beyond raw benchmark numbers, Grok 4 introduces revolutionary changes in AI model design that set it apart from competitors like OpenAI’s GPT-4o, Google’s Gemini, and Anthropic’s Claude. Architecturally, Grok 4 employs a hybrid modular design, containing multiple specialized sub-modules that can operate in parallel. This means it can handle different cognitive tasks (such as understanding natural language, writing code, and performing calculations) simultaneously without bottlenecks. In practical terms, when a prompt is complex (e.g., “Read this documentation and write a program based on it”), Grok 4 can parse the language, reason logically, and produce code in tandem, rather than sequentially. This innovative design contributes to both speed and accuracy in its responses. Moreover, with its colossal parameter count (1.7T), Grok 4 has a vast knowledge reservoir, notably larger than GPT-4’s underlying model, which can lead to more informed and context-rich answers. 
Grok 4 has also been positioned as a leader in the push toward Artificial General Intelligence (AGI). On the ARC-AGI-2 test – a challenging benchmark for abstract reasoning – Grok 4 scored 16.2%, roughly twice the performance of the next-best commercial model (Claude Opus 4). Similarly, in a comprehensive “Humanity’s Last Exam (HLE)” designed to test broad human-level reasoning, Grok 4 outperformed Google’s Gemini 2.5 and OpenAI’s models, especially when allowed to use tools.
Independent evaluations have corroborated xAI’s claims: one analysis gave Grok 4 an Intelligence Index of 73, higher than OpenAI’s and Google’s latest at 70. All these comparisons point to Grok 4’s competitive edge – it’s often at least on par with, or ahead of, the best that OpenAI and others have to offer. Its combination of strong reasoning, multimodal input, and developer-oriented design makes it a compelling choice for those who require more than a generic chatbot. In the fast-evolving AI landscape, Grok 4 has emerged as a model that not only matches its peers in many areas but also pushes the boundaries with novel capabilities, such as massive context handling and integrated tool use.
Performance comparison of Grok 4 with other AI models
Grok 4 pricingUsing Grok 4 through OpenRouter is billed by the number of tokens you send (input) and receive (output).
Base RatesInput tokens: $3.00 per million → $0.003 per 1K tokens
Output tokens: $15.00 per million → $0.015 per 1K tokens
Cached reads: 0.25× the normal price (for repeated prompts)
For example: If you send a 200-token prompt and Grok 4 generates a 2,000-token code output, the total cost is approximately 3¢.
💡
Comparison to other models
























ModelInput (per 1K)Output (per 1K)
Grok 4$0.003$0.015
GPT-4 (at launch)$0.03$0.06
GPT-3.5 (for ref.)~$0.0015~$0.002
﻿
Grok 4 is significantly cheaper than GPT-4’s original pricing, while still offering advanced capabilities, albeit at a higher cost than lightweight models like GPT-3.5.
Cost-saving optionsPrompt caching: Get repeated query results at a fraction of the cost.
Subscription plans: Options like SuperGrok offer token bundles and discounts for frequent use.
Bottom LineExpect around $0.018 per 1K response tokens and $0.003 per 1K prompt tokens. For most coding tasks, that translates to just a few cents per interaction — a reasonable price for state-of-the-art performance, especially when paired with Weave for tracking usage and efficiency.
Tutorial: Code generation and observability with Grok 4 and W&B WeaveThis tutorial will guide you through using xAI's cutting-edge Grok 4 model for advanced code generation and debugging, while simultaneously leveraging W&B Weave for comprehensive observability. By integrating Weave, you'll gain a powerful way to track every prompt, Grok 4’s response, and your evaluation results, creating a robust and transparent AI-powered development workflow.
This tutorial is also available on GitHub here.
Step 1: Setting Up Your Environment: Grok 4, OpenRouter, and W&B WeaveBefore we dive into code generation, let's get our environment ready by setting up access to Grok 4 via OpenRouter and initializing W&B Weave for logging.
1.1 Create OpenRouter and W\&B Accounts & API Keys:
﻿OpenRouter: Sign up on the OpenRouter website (openrouter.ai). Once logged in, navigate to the API keys section and create a new API key. This key will authenticate your requests to the OpenRouter API. Keep it secure.
﻿Weights & Biases: Create a free Weights & Biases account. After signing up, navigate to https://wandb.ai/authorize to find your Weights & Biases API key. This key is essential for logging data to your W&B dashboard.
1.2 Enable Access to xAI’s Grok Model on OpenRouter:
On the OpenRouter platform, locate xAI’s Grok 4 in the model list (labeled as x-ai/grok-4). You might need to agree to specific terms of service or ensure you have a payment method on file, as Grok 4 is a premium model. 
1.3 Install Necessary Python Libraries:
Ensure you have the requests library (for OpenRouter API calls) and weave (for W&B observability) installed. Open your terminal or command prompt and run:
pip install requests weave
1.4 Configure API Access and Initialize Weave in Your Script:
Now, let's set up the Python script where all our interactions will happen. We'll define two key functions wrapped with weave.op(): call_grok4 to interact with the model and evaluate_prime_function to test its output. Wrapping these with @weave.op() automatically logs their inputs, outputs, and execution details to your W&B dashboard, creating detailed 'traces'.
Replace "YOUR_OPENROUTER_API_KEY" with your actual key.
import weave
import requests
import math
import os
import re
﻿
# --- W&B Weave Initialization ---
project_name = "Grok4-CodeGen-Tutorial"
weave.init(project_name)
﻿
# --- OpenRouter API Configuration ---
# Secure way to input API key
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY", "YOUR_OPENROUTER_API_KEY") # Replace with your key if not using env var
OPENROUTER_URL = "https://openrouter.ai/api/v1/chat/completions"
GROK4_MODEL = "x-ai/grok-4"
﻿
# Define a Weave operation for calling Grok 4
@weave.op()
def call_grok4(prompt: str, context_messages: list = None) -> str:
    """Sends a prompt to Grok 4 via OpenRouter and returns the response."""
    if context_messages is None:
        context_messages = []
    
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json",
        "HTTP-Referer": "https://localhost:8888",  # Required for OpenRouter
        "X-Title": "Grok4-CodeGen-Tutorial"  # Optional but recommended
    }
    
    messages = context_messages + [{"role": "user", "content": prompt}]
    data = {
        "model": GROK4_MODEL,
        "messages": messages,
        "max_tokens": 1000,
        "temperature": 0.7
    }
    
    try:
        response = requests.post(OPENROUTER_URL, json=data, headers=headers)
        response.raise_for_status()
        result = response.json()
        
        # Check if response has the expected structure
        if 'choices' in result and len(result['choices']) > 0:
            return result['choices'][0]['message']['content']
        else:
            return f"Unexpected response format: {result}"
            
    except requests.exceptions.RequestException as e:
        return f"API request failed: {str(e)}"
    except KeyError as e:
        return f"Unexpected response structure: {str(e)}"
﻿
# Define a Weave operation to evaluate the generated code
@weave.op()
def evaluate_prime_function(code: str, test_cases: list) -> dict:
    """
    Executes the provided code and tests the is_prime function against test cases.
    Returns a dictionary of results for Weave to log.
    """
    results = {}
    total_tests = len(test_cases)
    passed_tests = 0
﻿
    try:
        # Create a clean namespace for code execution
        exec_globals = {"math": math}  # Provide math module
        
        # Execute the generated code to define the is_prime function
        exec(code, exec_globals)
        is_prime_func = exec_globals.get('is_prime')
﻿
        if not is_prime_func:
            raise ValueError("Generated code does not define 'is_prime' function.")
﻿
        # Test each case
        for num, expected in test_cases:
            try:
                actual = is_prime_func(num)
                is_correct = (actual == expected)
                if is_correct:
                    passed_tests += 1
                results[f"Test_{num}"] = {
                    "input": num,
                    "expected": expected,
                    "actual": actual,
                    "correct": is_correct
                }
            except Exception as test_error:
                results[f"Test_{num}"] = {
                    "input": num,
                    "expected": expected,
                    "actual": None,
                    "correct": False,
                    "error": str(test_error)
                }
                
    except Exception as e:
        results["execution_error"] = str(e)
        passed_tests = 0
﻿
    return {
        "passed_tests": passed_tests,
        "total_tests": total_tests,
        "accuracy": passed_tests / total_tests if total_tests > 0 else 0,
        "test_details": results
    }
﻿
print("Setup complete!")
You've now configured your environment and defined two weave.op functions. Every time call_grok4 or evaluate_prime_function is called, Weave will automatically log their inputs, outputs, and execution details to your W&B dashboard, creating a 'trace' of the process.
Step 2: Generate Initial Code with Grok 4Let's begin by asking Grok 4 to write a Python function that checks if a number is prime. We'll use our call_grok4 operation to send the prompt, ensuring this interaction is logged.
User Prompt: “Write a Python function is_prime(n) that returns True if n is a prime number and False otherwise. The function should be efficient for large n and include comments explaining the logic.”
initial_prompt = """Write a Python function is_prime(n) that returns True if n is a prime number and False otherwise. The function should be efficient for large n and include comments explaining the logic. Only return the function, with no explanations."""
﻿
print("Sending initial prompt to Grok 4...")
# Calling our Weave-wrapped function logs this interaction
generated_code = call_grok4(initial_prompt) 
print("\n--- Grok 4 Generated Code ---")
print(generated_code)
print("-----------------------------\n")
After running this, navigate to your W&B project (e.g., wandb.ai/your_wandb_username/Grok4-CodeGen-Tutorial). You should see a new "trace" entry under the "Traces" tab, representing this call_grok4 execution. Clicking on it will show you the exact prompt sent and the full code received, along with metadata like token usage and latency. It will look something like:
﻿
For our prime-checking function, Grok 4 generated:
```python
import math
﻿
def is_prime(n):
    """
    Checks if a number n is prime.
    
    A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself.
    This function uses an efficient trial division method optimized for performance, checking divisibility
    up to the square root of n, with additional optimizations to skip multiples of 2 and 3.
    
    Time complexity: O(sqrt(n)), which is efficient for n up to around 10^18 on modern hardware.
    
    :param n: An integer to check for primality.
    :return: True if n is prime, False otherwise.
    """
    # Handle small numbers: primes start from 2, so anything <= 1 is not prime.
    if n <= 1:
        return False
    # 2 and 3 are primes.
    if n <= 3:
        return True
    # Eliminate multiples of 2 and 3 early, as they can't be prime (except 2 and 3 themselves).
    if n % 2 == 0 or n % 3 == 0:
        return False
    
    # Now check for factors from 5 onwards, using the 6k ± 1 optimization.
    # All primes greater than 3 are of the form 6k ± 1.
    # We check i and i+2 (which is i+2 = (i+6)-4, but effectively covers the pattern) in steps of 6.
    # We only need to check up to sqrt(n) because if n has a factor larger than sqrt(n), 
    # it must also have a smaller one we've already checked.
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True
```
Grok 4 often outputs nicely formatted code blocks, including comments and handling edge cases. Always review the code for basic sanity: does it follow the prompt instructions? Are there obvious mistakes?
Step 3: Test and evaluate the generated code with WeaveNow, we'll verify if the generated code works correctly using a set of test cases. Our evaluate_prime_function operation will run the tests and automatically log the results to Weave, linking them to the code generation trace.
test_cases = [
    (1, False),    # 1 is not prime
    (2, True),     # 2 is prime
    (3, True),     # 3 is prime
    (4, False),    # 4 is not prime
    (15, False),   # 15 is not prime
    (17, True),    # 17 is prime
    (97, True),    # 97 is prime
    (100, False),  # 100 is not prime
    (997, True)    # 997 is prime
]
﻿
print("Evaluating the generated code...")
# Calling our Weave-wrapped evaluation function logs the results
evaluation_results = evaluate_prime_function(generated_code, test_cases)
print("\n--- Evaluation Results ---")
print(evaluation_results)
print("--------------------------\n")
Go back to your Weights & Biases dashboard. The call_grok4 trace you saw earlier should now be connected to an evaluate_prime_function trace. You can view the detailed evaluation_results dictionary, which includes passed_tests, total_tests, and accuracy, directly within the trace details. This immediately tells you how well Grok's initial output performed against your defined tests.
﻿
If you execute these tests (you'd typically copy the is_prime function generated by Grok into your Python environment or a notebook to be runnable for evaluate_prime_function), you should check the outputs against expected results. If a bug is found or the output isn't as expected, we move to the debugging step.
Step 4: Debug and refine with Grok 4 (Observed with Weave)One of Grok 4’s strengths is its ability to help debug code it (or someone else) wrote. Suppose our testing revealed that is_prime(1) returned True, which is incorrect (even if the provided example code snippet is correct; we'll simulate this for demonstration purposes). We can feed this information back to Grok 4 to get a fix. Weave will log this next interaction as a new trace, allowing us to track the iterative refinement process.
# Simulate a problematic code example for demonstration purposes
# In a real scenario, this would be the actual 'generated_code' from step 2 if it had a flaw.
problematic_code_example = """
def is_prime(n: int) -> bool:
    # Simulating a bug: This version might not handle n < 2 correctly
    if n % 2 == 0:
        return n == 2
    import math
    limit = int(math.sqrt(n)) + 1
    for divisor in range(3, limit, 2):
        if n % divisor == 0:
            return False
    return True
""" 
﻿
# Provide feedback to Grok 4 based on the evaluation findings
debugging_prompt = f"""The `is_prime` function you provided:
```python
{problematic_code_example}
Calling call_grok4 again logs this new interactionprint("Sending debugging prompt to Grok 4...")
corrected_code = call_grok4(debugging_prompt)
print("\n--- Grok 4 Corrected Code ---")
print(corrected_code)
print("------------------------------\n")
﻿
Re-evaluate the corrected code, logging the new evaluation with Weaveprint("Re-evaluating the corrected code...")
corrected_evaluation_results = evaluate_prime_function(corrected_code, test_cases)
print("\n--- Corrected Evaluation Results ---")
print(corrected_evaluation_results)
print("-----------------------------------\n")
Observe your W&B dashboard again. You'll see another trace for this call_grok4 and evaluate_prime_function pair.
Weave allows you to easily compare evaluation_results from the initial attempt versus the corrected_evaluation_results. This clear comparison demonstrates Grok's iterative improvement and the value of structured evaluation and tracing.* Grok will typically not only correct the mistake but also explain what was wrong, depending on the prompt. This iterative loop can be repeated: test the new code, and if something else comes up, ask Grok again. This is extremely powerful for troubleshooting edge cases or improving performance.
Step 5: Leveraging W&B Weave for deeper analysis and workflow improvementThe power of Weave extends far beyond simple logging. By consistently using weave.op() for your interactions and evaluations with Grok 4, you unlock powerful analytical capabilities on your W&B dashboard, turning your code generation process into a measurable and optimizable workflow:
Version Control for Prompts & Models: Every weave.op run is recorded as a trace. You can easily navigate through these traces to see how different prompts, context messages, or even versions of Grok 4 (if you were to compare models) affect the generated code and its subsequent evaluation scores. This is crucial for systematic prompt engineering and understanding the impact of your inputs.
Cost and Performance Monitoring: Weave automatically captures metrics like token usage and latency for each `call_grok4` operation (when provided by the API). You can visualize these trends over time to manage costs effectively, identify any performance bottlenecks, and optimize your prompts for efficiency.
Failure Analysis and Debugging: If Grok 4 ever produces "hallucinations," logically flawed code, or simply doesn't meet expectations, you have the exact prompt, response, and evaluation results recorded. This allows you to go back, understand *why* it failed, and iterate on your prompting strategy to improve future outcomes.
Comparing Model Variants and Prompts: Weave's robust logging capabilities enable you to run experiments comparing different prompts for the same task, or even compare the performance of Grok 4 against other LLMs (e.g., `gpt-4o`, `claude-3-opus`) on specific coding benchmarks within a unified dashboard.
Automated Dashboards and Reports: You can create custom, interactive dashboards in W&B to visualize key metrics (e.g., accuracy over time, cost per successful generation, average debugging iterations). These dashboards can be shared with your team, providing transparent insights into the efficiency of your AI-powered development.
By integrating Weave from the outset, the entire process of generating, testing, and debugging code with Grok 4 becomes transparent, measurable, and highly efficient. You're not just getting code; you're building a reliable, observable, and continuously improving AI-powered development workflow.
Alternative use cases for Grok 4
﻿
While this tutorial focuses on code generation, Grok 4’s versatile capabilities open up many other exciting applications:
Medical imaging analysis: With its multimodal prowess, Grok 4 can analyze and interpret images alongside text. For example, it could examine an X-ray or MRI scan and provide diagnostic suggestions or detailed descriptions of the findings. This could assist doctors by triaging cases or offering a second opinion based on visual data. (Of course, any medical use would require rigorous validation, but the potential is there.)
Creative content generation: Grok 4 isn’t limited to structured tasks – it can also generate creative text such as stories, poetry, or artwork captions. Imagine feeding it a prompt to write a short sci-fi story or a screenplay snippet; Grok’s advanced understanding allows it to maintain coherent plots and characters. Its image understanding might even allow it to write descriptions or alt-text for images, and future updates hint at image generation capabilities as well. This makes Grok a tool not just for coders, but for writers and artists looking to brainstorm or automate creative workflows.
Educational tutoring and problem solving: Thanks to its strong reasoning skills, Grok 4 can serve as an AI tutor or educational assistant. It can break down complex concepts in math, science, or programming into step-by-step explanations. For instance, a student could ask Grok to explain a calculus problem or to help debug a piece of code they wrote. Grok’s detailed reasoning (it was trained to be methodical in problem solving) is a huge asset here. It can also generate practice problems or quiz questions on the fly. Its proficiency in mathematics and logic (recall that it can even solve advanced math problems and has proved itself in competitions) makes it a powerful tool for learning environments.
These are just a few examples. Grok 4’s ability to handle text and images, understand context over long dialogues, and perform complex reasoning means it could be applied in research analysis, legal document review, data science (e.g., writing analysis code given a dataset), and more. Whenever you have a task that involves understanding complex inputs and generating well-reasoned outputs, Grok 4 is a candidate to consider.
ConclusionGrok 4 emerges as a powerful ally for developers and researchers, combining advanced code generation, deep reasoning, and seamless debugging into one tool. With trillions of parameters, multimodal inputs, and a vast context window, it consistently outperforms many other AI models in both benchmarks and real-world coding tasks.
When paired with W&B Weave, Grok 4 becomes not only productive but also transparent. Weave’s observability features let you track prompts, responses, accuracy, costs, and performance, ensuring that every interaction with the model is measurable and reliable.
The real value of Grok 4 lies in its versatility, as it accelerates development by automating boilerplate code while also assisting with complex algorithms, debugging, and optimization. It’s more than just a language model; it’s a developer assistant that pushes us closer to true AI-augmented programming.
﻿
﻿
Model	Input (per 1K)	Output (per 1K)
Grok 4	$0.003	$0.015
GPT-4 (at launch)	$0.03	$0.06
GPT-3.5 (for ref.)	~$0.0015	~$0.002
Add a comment
Tags: Community Posts, Articles, LLM
Iterate on AI agents and models faster. Try Weights & Biases today.