Code generation and debugging with the Grok 4 API
Grok 4 tutorial: A step-by-step guide to using xAI’s Grok 4 for code generation and debugging, including OpenRouter setup and W&B Weave integration.
Created on July 10|Last edited on August 25
Comment
Grok 4, developed by xAI, is a cutting-edge AI model designed for advanced code generation and optimization. It offers a multimodal capability (processing text and images) and an expansive context window, making it a robust tool for developers working on complex projects. Grok 4’s design emphasizes deep reasoning, allowing it to tackle coding problems with a “think before responding” approach that enhances accuracy. This means it can generate code solutions and pinpoint bugs more effectively than many of its predecessors.
One of the key benefits of using Grok 4 is its ability to handle large codebases and intricate debugging tasks seamlessly. It surpasses popular models like OpenAI’s ChatGPT and Anthropic’s Claude on several advanced benchmarks, demonstrating superior performance in reasoning and problem-solving. Unlike typical chatbots, Grok 4 is designed to assist developers by generating code, explaining logic, and resolving errors in a more streamlined manner.
Throughout this tutorial, we’ll explore how to set up Grok 4 via OpenRouter, use it for code generation, and leverage W&B Weave for evaluating outputs and monitoring the model’s behavior in real-time.
Table of contents
What is Grok 4?Grok 4 compared to other AI modelsGrok 4 pricingTutorial: Code generation and observability with Grok 4 and W&B Weave1. Setting Up Your Environment: Grok 4, OpenRouter, and W&B Weave2. Generate Initial Code with Grok 43. Test and evaluate the generated code with Weave4. Debug and refine with Grok 4 (Observed with Weave)5. Leveraging W&B Weave for deeper analysis and workflow improvementAlternative use cases for Grok 4Conclusion
What is Grok 4?
Grok 4 is xAI’s flagship large language model, purpose-built to excel at complex tasks like programming assistance and critical reasoning. It represents a significant leap over earlier models (like Grok 3) in both scale and capability. Under the hood, Grok 4 uses a hybrid neural architecture with about 1.7 trillion parameters, which is orders of magnitude larger than many competing models . This massive scale, combined with specialized attention mechanisms for code and math, allows Grok 4 to understand and generate very intricate solutions. For example, it maintains a context window up to 256,000 tokens, letting it consider extensive code files or documents when formulating responses . In practical terms, Grok 4 can read and reason about entire code repositories or lengthy problem descriptions without losing track of details.
In comparison to other AI models like ChatGPT, Claude, or Google’s Gemini, Grok 4 stands out for its unique features and advanced capabilities. Notably, it’s a multimodal model, meaning it can handle both text and images as inputs . This opens the door to applications such as analyzing a code screenshot or a diagram and providing text-based answers. Additionally, Grok 4 introduces an advanced function-calling capability, allowing it to interface with external tools or APIs when needed . This is similar to how one might extend ChatGPT with plugins, but Grok 4 has this tool-use mechanism built into its API, enabling it to perform actions like code execution or data retrieval as part of its reasoning process. For software developers, xAI provides a specialized variant called “Grok 4 Code”, which is tailored for integration with development environments (like the Cursor editor) . Grok 4 Code goes beyond basic text completion – it can suggest code optimizations, help with debugging, and even recommend architectural improvements. In short, Grok 4 is engineered to be a developer’s AI assistant, combining the conversational prowess of a chatbot with the technical depth of a coding expert.

Grok 4's hybrid neural architecture and attention mechanism
Grok 4 compared to other AI models
Grok 4’s performance on standard benchmarks places it at the forefront of current AI models. In a suite of evaluations, xAI reported that Grok 4 achieved top-tier results across diverse domains. For instance, on the AIME (American Invitational Math Exam), Grok 4 achieved a near-perfect score (100%), a dramatic improvement over Grok 3’s performance and even surpassing the averages of human experts. It also demonstrated exceptional scientific reasoning, scoring 87% on a graduate-level physics Q&A test (GPQA), compared to 75% by its predecessor. These figures demonstrate a profound understanding of complex domains, resulting in more effective code generation in scientific or mathematical computing contexts. On a coding-specific benchmark (SWE-bench), Grok 4 scored around 72–75%, highlighting its strong capabilities in code generation and debugging tasks. Such scores suggest that in many programming challenges, Grok 4 can outperform other models in both accuracy and reliability.
Beyond raw benchmark numbers, Grok 4 introduces revolutionary changes in AI model design that set it apart from competitors like OpenAI’s GPT-4, Google’s Gemini, and Anthropic’s Claude. Architecturally, Grok 4 employs a hybrid modular design, containing multiple specialized sub-modules that can operate in parallel. This means it can handle different cognitive tasks (such as understanding natural language, writing code, and performing calculations) simultaneously without bottlenecks. In practical terms, when a prompt is complex (e.g., “Read this documentation and write a program based on it”), Grok 4 can parse the language, reason logically, and produce code in tandem, rather than sequentially. This innovative design contributes to both speed and accuracy in its responses. Moreover, with its colossal parameter count (1.7T), Grok 4 has a vast knowledge reservoir, notably larger than GPT-4’s underlying model, which can lead to more informed and context-rich answers.
Grok 4 has also been positioned as a leader in the push toward Artificial General Intelligence (AGI). On the ARC-AGI-2 test – a challenging benchmark for abstract reasoning – Grok 4 scored 16.2%, roughly twice the performance of the next-best commercial model (Claude Opus 4). Similarly, in a comprehensive “Humanity’s Last Exam (HLE)” designed to test broad human-level reasoning, Grok 4 outperformed Google’s Gemini 2.5 and OpenAI’s models, especially when allowed to use tools. Independent evaluations have corroborated xAI’s claims: one analysis gave Grok 4 an Intelligence Index of 73, higher than OpenAI’s and Google’s latest at 70 . All these comparisons point to Grok 4’s competitive edge – it’s often at least on par with, or ahead of, the best that OpenAI and others have to offer. Its combination of strong reasoning, multimodal input, and developer-oriented design makes it a compelling choice for those who need more than a generic chatbot . In the fast-evolving AI landscape, Grok 4 has emerged as a model that not only matches its peers in many areas but also pushes the boundary with novel capabilities like massive context handling and integrated tool use.

Performance comparison of Grok 4 with other AI models
Grok 4 pricing
Using Grok 4 through OpenRouter incurs costs based on the amount of data (tokens) you send and receive, so it’s essential to understand the pricing to manage your usage effectively. OpenRouter’s pricing for Grok 4 is structured on a per-million-token-processed basis. As of now, input tokens are charged at $3.00 per million, and output tokens are charged at $15.00 per million. In more granular terms, that equates to $0.000003 per input token and $0.000015 per output token. This difference in input vs. output cost is common in AI APIs – generating text is more computationally intensive (hence higher priced) than reading the prompt.
For context, these rates place Grok 4 in a premium tier of AI services. For example, OpenAI’s GPT-4 (at launch) was priced at around $0.03 per 1,000 input tokens and $0.06 per 1,000 output tokens (approximately $30 and $60 per million, respectively). By comparison, Grok 4’s token costs are lower than the original GPT-4 pricing, making it relatively cost-effective given its advanced capabilities. However, compared to some lighter models (such as older GPT-3.5 variants or smaller open-source models), Grok 4 is still more expensive, which is understandable since you’re paying for a state-of-the-art, large-scale model service.
It’s also worth noting that xAI offers prompt caching and subscription plans. For instance, if you send identical or very similar prompts repeatedly, OpenRouter can use cached results at a fraction of the cost (cached reads at 0.25 times the normal price). xAI’s own subscription tiers (like SuperGrok) provide certain token allowances or discounted rates for a fixed annual fee. Depending on your usage volume, these options might reduce costs.
In summary, expect roughly $0.018 for every 1,000 tokens of response you generate (and $0.003 per 1,000 tokens of prompt). A typical coding session where you might send a few hundred tokens of instructions and get a couple thousand tokens of code back will cost only a few cents. Still, keep an eye on usage with OpenRouter’s dashboards or by parsing the API response (which usually includes a usage field counting tokens). By understanding the pricing model, you can use Grok 4 cost-effectively, leveraging its premium features while staying within your budget. The advanced capabilities – from long context handling to high-quality code generation – often justify the cost for professional use, especially when compared to the time saved and the improvements in code quality.
Tutorial: Code generation and observability with Grok 4 and W&B Weave
This tutorial will guide you through using xAI's cutting-edge Grok 4 model for advanced code generation and debugging, while simultaneously leveraging W&B Weave for comprehensive observability. By integrating Weave, you'll gain a powerful way to track every prompt, Grok 4’s response, and your evaluation results, creating a robust and transparent AI-powered development workflow.
1. Setting Up Your Environment: Grok 4, OpenRouter, and W&B Weave
Before we dive into code generation, let's get our environment ready by setting up access to Grok 4 via OpenRouter and initializing W&B Weave for logging.
- 1.1 Create OpenRouter and W\&B Accounts & API Keys:
- OpenRouter: Sign up on the OpenRouter website (openrouter.ai). Once logged in, navigate to the API keys section and create a new API key. This key will authenticate your requests to the OpenRouter API. Keep it secure.
- Weights & Biases: Create a free Weights & Biases account. After signing up, navigate to https://wandb.ai/authorize to find your Weights & Biases API key. This key is essential for logging data to your W&B dashboard.
- 1.2 Enable Access to xAI’s Grok Model on OpenRouter: On the OpenRouter platform, locate xAI’s Grok 4 in the model list (labeled as x-ai/grok-4). You might need to agree to specific terms of service or ensure you have a payment method on file, as Grok 4 is a premium model.
- 1.3 Install Necessary Python Libraries: Ensure you have the requests library (for OpenRouter API calls) and weave (for W&B observability) installed. Open your terminal or command prompt and run:pip install requests weave
- 1.4 Configure API Access and Initialize Weave in Your Script: Now, let's set up the Python script where all our interactions will happen. We'll define two key functions wrapped with weave.op(): call_grok4 to interact with the model and evaluate_prime_function to test its output. Wrapping these with @weave.op() automatically logs their inputs, outputs, and execution details to your W&B dashboard, creating detailed 'traces'.
- Replace "YOUR_OPENROUTER_API_KEY" with your actual key.import weaveimport requestsimport mathimport osimport re# --- W&B Weave Initialization ---project_name = "Grok4-CodeGen-Tutorial"weave.init(project_name)# --- OpenRouter API Configuration ---# Secure way to input API keyOPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY", "YOUR_OPENROUTER_API_KEY") # Replace with your key if not using env varOPENROUTER_URL = "https://openrouter.ai/api/v1/chat/completions"GROK4_MODEL = "x-ai/grok-4"# Define a Weave operation for calling Grok 4@weave.op()def call_grok4(prompt: str, context_messages: list = None) -> str:"""Sends a prompt to Grok 4 via OpenRouter and returns the response."""if context_messages is None:context_messages = []headers = {"Authorization": f"Bearer {OPENROUTER_API_KEY}","Content-Type": "application/json","HTTP-Referer": "https://localhost:8888", # Required for OpenRouter"X-Title": "Grok4-CodeGen-Tutorial" # Optional but recommended}messages = context_messages + [{"role": "user", "content": prompt}]data = {"model": GROK4_MODEL,"messages": messages,"max_tokens": 1000,"temperature": 0.7}try:response = requests.post(OPENROUTER_URL, json=data, headers=headers)response.raise_for_status()result = response.json()# Check if response has the expected structureif 'choices' in result and len(result['choices']) > 0:return result['choices'][0]['message']['content']else:return f"Unexpected response format: {result}"except requests.exceptions.RequestException as e:return f"API request failed: {str(e)}"except KeyError as e:return f"Unexpected response structure: {str(e)}"# Define a Weave operation to evaluate the generated code@weave.op()def evaluate_prime_function(code: str, test_cases: list) -> dict:"""Executes the provided code and tests the is_prime function against test cases.Returns a dictionary of results for Weave to log."""results = {}total_tests = len(test_cases)passed_tests = 0try:# Create a clean namespace for code executionexec_globals = {"math": math} # Provide math module# Execute the generated code to define the is_prime functionexec(code, exec_globals)is_prime_func = exec_globals.get('is_prime')if not is_prime_func:raise ValueError("Generated code does not define 'is_prime' function.")# Test each casefor num, expected in test_cases:try:actual = is_prime_func(num)is_correct = (actual == expected)if is_correct:passed_tests += 1results[f"Test_{num}"] = {"input": num,"expected": expected,"actual": actual,"correct": is_correct}except Exception as test_error:results[f"Test_{num}"] = {"input": num,"expected": expected,"actual": None,"correct": False,"error": str(test_error)}except Exception as e:results["execution_error"] = str(e)passed_tests = 0return {"passed_tests": passed_tests,"total_tests": total_tests,"accuracy": passed_tests / total_tests if total_tests > 0 else 0,"test_details": results}print("Setup complete!")
- You've now configured your environment and defined two weave.op functions. Every time call_grok4 or evaluate_prime_function is called, Weave will automatically log their inputs, outputs, and execution details to your W&B dashboard, creating a 'trace' of the process.
2. Generate Initial Code with Grok 4
Let's begin by asking Grok 4 to write a Python function that checks if a number is prime. We'll use our call_grok4 operation to send the prompt, ensuring this interaction is logged.
User Prompt: “Write a Python function is_prime(n) that returns True if n is a prime number and False otherwise. The function should be efficient for large n and include comments explaining the logic.”
initial_prompt = """Write a Python function is_prime(n) that returns True if n is a prime number and False otherwise. The function should be efficient for large n and include comments explaining the logic. Only return the function, with no explanations."""print("Sending initial prompt to Grok 4...")# Calling our Weave-wrapped function logs this interactiongenerated_code = call_grok4(initial_prompt)print("\n--- Grok 4 Generated Code ---")print(generated_code)print("-----------------------------\n")
After running this, navigate to your W&B project (e.g., wandb.ai/your_wandb_username/Grok4-CodeGen-Tutorial). You should see a new "trace" entry under the "Traces" tab, representing this call_grok4 execution. Clicking on it will show you the exact prompt sent and the full code received, along with metadata like token usage and latency. It will look something like:

For our prime-checking function, Grok 4 generated:
```pythonimport mathdef is_prime(n):"""Checks if a number n is prime.A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself.This function uses an efficient trial division method optimized for performance, checking divisibilityup to the square root of n, with additional optimizations to skip multiples of 2 and 3.Time complexity: O(sqrt(n)), which is efficient for n up to around 10^18 on modern hardware.:param n: An integer to check for primality.:return: True if n is prime, False otherwise."""# Handle small numbers: primes start from 2, so anything <= 1 is not prime.if n <= 1:return False# 2 and 3 are primes.if n <= 3:return True# Eliminate multiples of 2 and 3 early, as they can't be prime (except 2 and 3 themselves).if n % 2 == 0 or n % 3 == 0:return False# Now check for factors from 5 onwards, using the 6k ± 1 optimization.# All primes greater than 3 are of the form 6k ± 1.# We check i and i+2 (which is i+2 = (i+6)-4, but effectively covers the pattern) in steps of 6.# We only need to check up to sqrt(n) because if n has a factor larger than sqrt(n),# it must also have a smaller one we've already checked.i = 5while i * i <= n:if n % i == 0 or n % (i + 2) == 0:return Falsei += 6return True```
Grok 4 often outputs nicely formatted code blocks, including comments and handling edge cases. Always review the code for basic sanity: does it follow the prompt instructions? Are there obvious mistakes?
3. Test and evaluate the generated code with Weave
Now, we'll verify if the generated code works correctly using a set of test cases. Our evaluate_prime_function operation will run the tests and automatically log the results to Weave, linking them to the code generation trace.
test_cases = [(1, False), # 1 is not prime(2, True), # 2 is prime(3, True), # 3 is prime(4, False), # 4 is not prime(15, False), # 15 is not prime(17, True), # 17 is prime(97, True), # 97 is prime(100, False), # 100 is not prime(997, True) # 997 is prime]print("Evaluating the generated code...")# Calling our Weave-wrapped evaluation function logs the resultsevaluation_results = evaluate_prime_function(generated_code, test_cases)print("\n--- Evaluation Results ---")print(evaluation_results)print("--------------------------\n")
Go back to your Weights & Biases dashboard. The call_grok4 trace you saw earlier should now be connected to an evaluate_prime_function trace. You can view the detailed evaluation_results dictionary, which includes passed_tests, total_tests, and accuracy, directly within the trace details. This immediately tells you how well Grok's initial output performed against your defined tests.

If you execute these tests (you'd typically copy the is_prime function generated by Grok into your Python environment or a notebook to be runnable for evaluate_prime_function), you should check the outputs against expected results. If a bug is found or the output isn't as expected, we move to the debugging step.
4. Debug and refine with Grok 4 (Observed with Weave)
One of Grok 4’s strengths is its ability to help debug code it (or someone else) wrote. Suppose our testing revealed that is_prime(1) returned True, which is incorrect (even if the provided example code snippet is correct; we'll simulate this for demonstration purposes). We can feed this information back to Grok 4 to get a fix. Weave will log this next interaction as a new trace, allowing us to track the iterative refinement process.
# Simulate a problematic code example for demonstration purposes# In a real scenario, this would be the actual 'generated_code' from step 2 if it had a flaw.problematic_code_example = """def is_prime(n: int) -> bool:# Simulating a bug: This version might not handle n < 2 correctlyif n % 2 == 0:return n == 2import mathlimit = int(math.sqrt(n)) + 1for divisor in range(3, limit, 2):if n % divisor == 0:return Falsereturn True"""# Provide feedback to Grok 4 based on the evaluation findingsdebugging_prompt = f"""The `is_prime` function you provided:```python{problematic_code_example}
Calling call_grok4 again logs this new interaction
print("Sending debugging prompt to Grok 4...") corrected_code = call_grok4(debugging_prompt) print("\n--- Grok 4 Corrected Code ---") print(corrected_code) print("------------------------------\n")

Re-evaluate the corrected code, logging the new evaluation with Weave
print("Re-evaluating the corrected code...") corrected_evaluation_results = evaluate_prime_function(corrected_code, test_cases) print("\n--- Corrected Evaluation Results ---") print(corrected_evaluation_results) print("-----------------------------------\n")
Observe your W&B dashboard again. You'll see another trace for this call_grok4 and evaluate_prime_function pair.
Weave allows you to easily compare evaluation_results from the initial attempt versus the corrected_evaluation_results. This clear comparison demonstrates Grok's iterative improvement and the value of structured evaluation and tracing.* Grok will typically not only correct the mistake but also explain what was wrong, depending on the prompt. This iterative loop can be repeated: test the new code, and if something else comes up, ask Grok again. This is extremely powerful for troubleshooting edge cases or improving performance.
5. Leveraging W&B Weave for deeper analysis and workflow improvement
The power of Weave extends far beyond simple logging. By consistently using weave.op() for your interactions and evaluations with Grok 4, you unlock powerful analytical capabilities on your W&B dashboard, turning your code generation process into a measurable and optimizable workflow:
- Version Control for Prompts & Models: Every weave.op run is recorded as a trace. You can easily navigate through these traces to see how different prompts, context messages, or even versions of Grok 4 (if you were to compare models) affect the generated code and its subsequent evaluation scores. This is crucial for systematic prompt engineering and understanding the impact of your inputs.
- Cost and Performance Monitoring: Weave automatically captures metrics like token usage and latency for each `call_grok4` operation (when provided by the API). You can visualize these trends over time to manage costs effectively, identify any performance bottlenecks, and optimize your prompts for efficiency.
- Failure Analysis and Debugging: If Grok 4 ever produces "hallucinations," logically flawed code, or simply doesn't meet expectations, you have the exact prompt, response, and evaluation results recorded. This allows you to go back, understand *why* it failed, and iterate on your prompting strategy to improve future outcomes.
- Comparing Model Variants and Prompts: Weave's robust logging capabilities enable you to run experiments comparing different prompts for the same task, or even compare the performance of Grok 4 against other LLMs (e.g., `gpt-4o`, `claude-3-opus`) on specific coding benchmarks within a unified dashboard.
- Automated Dashboards and Reports: You can create custom, interactive dashboards in W&B to visualize key metrics (e.g., accuracy over time, cost per successful generation, average debugging iterations). These dashboards can be shared with your team, providing transparent insights into the efficiency of your AI-powered development.
By integrating Weave from the outset, the entire process of generating, testing, and debugging code with Grok 4 becomes transparent, measurable, and highly efficient. You're not just getting code; you're building a reliable, observable, and continuously improving AI-powered development workflow.
Alternative use cases for Grok 4

While this tutorial focuses on code generation, Grok 4’s versatile capabilities open up many other exciting applications:
- Medical imaging analysis: With its multimodal prowess, Grok 4 can analyze and interpret images alongside text. For example, it could examine an X-ray or MRI scan and provide diagnostic suggestions or detailed descriptions of the findings . This could assist doctors by triaging cases or offering a second opinion based on visual data. (Of course, any medical use would require rigorous validation, but the potential is there.)
- Creative content generation: Grok 4 isn’t limited to structured tasks – it can also generate creative text such as stories, poetry, or artwork captions. Imagine feeding it a prompt to write a short sci-fi story or a screenplay snippet; Grok’s advanced understanding allows it to maintain coherent plots and characters. Its image understanding might even allow it to write descriptions or alt-text for images, and future updates hint at image generation abilities as well . This makes Grok a tool not just for coders, but for writers and artists looking to brainstorm or automate creative workflows.
- Educational tutoring and problem solving: Thanks to its strong reasoning skills, Grok 4 can serve as an AI tutor or educational assistant. It can break down complex concepts in math, science, or programming into step-by-step explanations. For instance, a student could ask Grok to explain a calculus problem or to help debug a piece of code they wrote. Grok’s detailed reasoning (it was trained to be methodical in problem solving) is a huge asset here. It can also generate practice problems or quiz questions on the fly. Its proficiency in mathematics and logic (recall that it can even solve advanced math problems and has proved itself in competitions) makes it a powerful tool for learning environments.
These are just a few examples. Grok 4’s ability to handle text and images, understand context over long dialogues, and perform complex reasoning means it could be applied in research analysis, legal document review, data science (e.g., writing analysis code given a dataset), and more. Whenever you have a task that involves understanding complex inputs and generating well-reasoned outputs, Grok 4 is a candidate to consider.
Conclusion
Grok 4 stands out as a powerful new ally for developers and researchers in need of advanced code generation and problem-solving capabilities. In this article, we demonstrated how Grok 4 can be set up through OpenRouter and used to write and debug code, highlighting its strengths over more conventional AI models. With a staggering scale (trillions of parameters) and cutting-edge features like multimodal understanding and enormous context windows, Grok 4 often surpasses other AI models in both benchmarks and practical use. Its ability to not only generate code but also reason through complex tasks and integrate with developer tools makes it a uniquely valuable resource.
The benefits of using Grok 4 for code generation are clear: it can save development time by handling boilerplate or complex algorithm writing, help catch and fix bugs through iterative dialogue, and provide insights or optimizations that might not be immediately obvious. Compared to its competitors, Grok 4 offers a combination of raw performance and flexibility – it’s like having an expert programmer who also has read every software manual and math textbook available.
Moreover, by using Grok 4 in tandem with W&B Weave, teams can ensure that this powerful model is deployed with confidence. Weave’s observability and evaluation tools enable you to monitor Grok’s outputs, measure accuracy, and maintain real-time oversight of costs and performance in real-time. This means you can harness Grok 4’s capabilities while reliably tracking its impact on your project’s outcomes.
Grok 4 is more than just a new language model – it’s an advanced developer assistant that brings us a step closer to true AI-aided programming. Whether you’re generating code, analyzing data, or exploring creative ideas, Grok 4 has the potential to boost productivity and open up new possibilities. We encourage you to explore Grok 4’s potential in your own projects and leverage its strengths by utilizing the techniques and tools described here to maximize the benefits of this next-generation AI model. Happy coding with Grok 4!
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.