A guide to LLM debugging, tracing, and monitoring
Learn how to debug, trace, and monitor your LLM applications with W&B Weave. Gain visibility into performance, errors, and safety for reliable AI.
Created on August 11|Last edited on August 12
Comment
LLM monitoring is crucial for ensuring the performance and reliability of AI applications. Large Language Models (LLMs) can behave unpredictably, producing varied outputs for the same input due to their nondeterministic nature. Without proper observability, it becomes difficult to pinpoint why an AI application is failing or underperforming. By leveraging tools like Weights & Biases (W&B) Weave, developers can gain visibility into their LLM’s behavior and internals, which greatly enhances debugging, tracing, and monitoring processes. This leads to more robust and efficient models in production, as issues can be identified and resolved faster, improving overall reliability and user trust.
W&B Weave is a framework specifically designed for LLM-based applications, supporting every stage of development from tracking and tracing to evaluation and guardrailing. In this article, we will explore the concept of LLM observability and why it matters. We will then highlight the key features a good LLM monitoring solution should provide, and discuss how observability tools help monitor model performance, detect biases, and ensure safety. Finally, we provide a step-by-step tutorial on using W&B Weave for debugging, tracing, and monitoring a live LLM application. Whether you are new to AI or an experienced practitioner, this deep dive will equip you with knowledge and practical steps to iterate on and debug your LLM applications faster.
Understanding LLM observability
What is LLM observability and its importance?
LLM observability refers to the ability to continuously monitor, analyze, and assess the behavior and outputs of an LLM-powered application, especially when it’s running in a production environment. In simpler terms, observability provides visibility into what your LLM is doing under the hood – the inputs it receives, the outputs it generates, and the intermediate decisions it makes. This visibility is essential because LLM-based systems often involve complex chains of prompts, model calls, and tool usages which can be difficult to reason about without proper logging and tracing.
The importance of LLM observability stems from the unpredictable, non-deterministic nature of LLMs. The same input might yield different outputs on different runs, making bugs hard to reproduce and fix. With robust observability in place, developers can track the model’s outputs over time, detect regressions or weird behavior, and trace the sequence of events leading to any issue. In short, observability gives you insight into the “why” behind your LLM’s performance. This helps in identifying performance bottlenecks, pinpointing failure points in a chain of calls, and understanding if the model is behaving as expected.
Another crucial aspect is quality assurance. Continuous monitoring of an LLM’s outputs allows teams to evaluate the quality and consistency of responses. For example, if an AI assistant suddenly starts giving irrelevant answers or shows a drop in accuracy, observability tools can alert developers to this change. Given the high stakes of deploying AI systems (which might face real users), having this level of insight ensures reliability and builds trust in the system. In summary, LLM observability provides the eyes and ears into complex AI systems, enabling proactive detection of problems and maintenance of model reliability.
Key features of LLM observability solutions
A robust LLM observability solution should provide several key features to effectively monitor and debug AI applications:
- End-to-end tracing of LLM calls: The tool should capture each span of work – e.g. a single model call or tool invocation – and link these spans into complete traces representing a full request/response cycle. This means if your application involves multiple steps (like an agent calling multiple tools or models), all those steps are recorded as part of one trace. Having complete traces makes it possible to replay and analyze the sequence of events leading to an output, which is invaluable for debugging.
- Logging inputs and outputs (and intermediate data): All inputs fed to the LLM (prompts, user queries, etc.) and outputs generated should be logged. Additionally, any intermediate results (like an agent’s thought process or retrieved context in a RAG system) should be captured. This comprehensive logging ensures that when something goes wrong, you can inspect exactly what the model saw and produced at each step. W&B Weave, for instance, automatically tracks inputs, outputs, and even the code of functions if you instrument them, giving full visibility into each operation.
- Performance and health metrics: Observability isn’t just about content of responses—it’s also about metrics. Key metrics include latency (response time for each call), throughput (requests handled), error rates (failed calls or exceptions), and resource usage such as token counts and costs. Good monitoring tools will track token usage and estimated API costs per call, alerting you if usage spikes unexpectedly. W&B Weave, for example, automatically calculates token usage and associated costs for each LLM call, helping you keep an eye on latency and expenses. Monitoring these metrics ensures your LLM app stays within performance and budget expectations.
- Debugging and error analysis: When an LLM call fails (throws an error, times out, or returns a malformed response), the observability tool should surface this clearly. Features like stack trace logging for exceptions, or highlighting of calls that returned incomplete data, are very helpful. The ability to dive into a particular failed span and inspect variables or error messages accelerates troubleshooting. Tools might allow tagging or commenting on problematic traces for team members to investigate.
- Bias and safety evaluation: Modern LLM deployments need to be ethical and safe. Observability solutions should aid in evaluating the content of outputs for biases, toxicity, or policy violations. This could mean integration with scorers or evaluators that rate an output for certain criteria (e.g. a toxicity score, or a bias metric). For instance, W&B Weave includes pre-built LLM-based scorers for common tasks like hallucination detection, summarization quality, and relevance. These can be used to automatically evaluate each output. By examining these scores in your monitoring dashboard, you can catch if your model’s outputs are drifting into unsafe territory or becoming biased. Some tools even support guardrails, which are active checks that can block or modify unsafe outputs before they reach the user (more on this later).
- Interpretability and metadata: It’s helpful if the observability platform can enrich traces with additional context. This might include metadata like which model version was used, which prompt template, or environment info. It could also involve storing embeddings or other representations for each call to analyze patterns. Advanced platforms offer visualization of conversation flows or allow searching through past calls by content. The goal is to make it easier to interpret why the model responded as it did. Features like W&B Weave’s custom attributes (via weave.attributes context manager) let you tag certain calls with domain-specific metadata for easier filtering and analysis.
Key features of LLM observability include comprehensive tracing, input/output logging, performance metrics tracking, automated evaluation for quality/safety, and rich interpretability tools. When evaluating solutions, ensure they cover both the technical performance (latency, errors, resource usage) and the content quality (accuracy, relevance, safety) aspects of your LLM system. A well-rounded observability platform will empower you to monitor and improve your AI application from all angles.
How LLM observability tools enhance monitoring
Monitoring model performance and health
Observability tools greatly simplify the task of monitoring an LLM’s performance and health in real time. Instead of relying on manual print statements or ad-hoc logs, these tools automatically collect and display crucial metrics about your model’s behavior. For example, an observability platform will track each call’s response time, the number of tokens consumed, and whether the call succeeded or errored out. By aggregating these metrics, you can get a live view of your model’s throughput, latency distribution, error rate, and more.
Latency and throughput: Tracking response times for each LLM invocation helps ensure your application meets its latency targets. If the average latency starts creeping up (say due to a larger prompt or an external API slowdown), a monitoring dashboard will reveal this trend quickly. Spikes in latency or sudden slowdowns can be caught and diagnosed. Likewise, throughput (requests per second) can be monitored to ensure your system is handling the expected load. Bottlenecks can be identified by looking at traces – e.g. if one particular step in a chain is consistently slow, that shows up in the trace timeline.
Token usage and cost: Many LLMs (like those provided via APIs such as OpenAI) have usage-based billing. Observability tools often integrate cost calculation so you can monitor how many tokens are used per call and estimate the cost incurred. W&B Weave automatically logs token counts from model responses and can calculate cost if it knows the model’s pricing. Keeping an eye on this is vital for live systems – if a bug causes an unexpectedly long output or an infinite loop, you might burn through tokens (and money) quickly. A good monitoring setup would alert you if a single request’s token usage or overall daily spend crosses a threshold.
Error tracking: Monitoring tools capture exceptions or failures during LLM interactions. For instance, if a model call fails due to an invalid input or an API issue, the trace for that request will be marked as failed. By tracking the rate of errors, you can detect if something in the system has broken. Perhaps a downstream service is down or the LLM is returning more refusals than usual – these would reflect in error metrics. Observability platforms often let you filter or search traces by error type, making it easier to debug the root cause once you notice an issue.
Resource usage and system metrics: In some cases, you might also monitor GPU/CPU utilization or memory if you are hosting the model yourself. While not specific to LLM logic, these are health indicators for your deployment. Some LLMOps tools can integrate with infrastructure monitoring to correlate system metrics with LLM behavior (e.g., high CPU usage correlating with slow responses).
Crucially, observability tools like W&B Weave allow you to visualize these performance metrics over time. You might have a dashboard showing average latency per hour, a running count of tokens used today, or the success rate of responses. Continuous monitoring means you can set up alerts for anomalies. For example, you could be notified if the error rate in the last 10 minutes exceeds a certain percentage, or if latency suddenly doubles. This proactive alerting is key to maintaining a reliable AI service – you get notified of problems often before users notice them.
Observability tools enhance performance monitoring by automating the collection of vital metrics and presenting them in an actionable way. They give real-time insights into how your LLM is performing (speed, usage, error frequency) and free you from having to instrument all these metrics manually. By using such tools, you can ensure your LLM application stays healthy and promptly address any performance regressions or outages.
Evaluating bias and safety in LLM outputs
Beyond pure performance metrics, it’s equally important to monitor what the LLM is saying. LLM observability tools can help evaluate the content of model outputs for issues like bias, toxicity, or other safety concerns. This is where integrated evaluation and guardrailing features come into play.
One approach is through automated scorers or evaluators. These are functions that analyze an LLM’s output (and possibly input) and produce a score or verdict about some aspect of the output. For example, a toxicity scorer might use a pretrained classifier or heuristic to rate the output’s level of offensiveness, or a bias scorer might check if certain demographics are portrayed negatively. W&B Weave has a unified scoring system that allows developers to attach such evaluations to any LLM call. In fact, Weave comes with a collection of predefined scorers (like hallucination detection, summarization quality, embedding-based relevancy, etc.) that you can use out-of-the-box. These automated checks run alongside your normal LLM calls and log their results in the observability dashboard.
How does this help? If you have a toxicity scorer monitoring your chatbot’s responses as a monitor, you can track the average toxicity score of outputs over time and be alerted if it rises above a comfortable level. Monitors in W&B Weave allow passive observation of such metrics – for instance, you might log that 5% of responses in the last hour were flagged as having a high toxicity score, prompting a deeper look. This is crucial for ensuring ethical AI behavior in production. You don’t want your model gradually drifting into giving biased or harmful outputs without noticing.
In addition to passive monitors, observability tools often support guardrails, which are active interventions for safety. A guardrail uses the same kind of scorer or check, but in a blocking or modifying capacity. For example, with Weave you could use a toxicity scorer as a guardrail that blocks any response deemed toxic (or replaces it with a sanitized message) in real-time. The tool intercepts the model’s output before it reaches the user, if it violates certain criteria. This ensures that unsafe content is caught immediately. The beauty of Weave’s design is that guardrails double as monitors – even if a toxic response is blocked, the event is logged, and you can later review how often this was happening and under what circumstances.
Another area is bias evaluation. Let’s say you want to ensure your model’s outputs are not biased against a certain group. You could implement a bias checker (for instance, scanning the output for certain negative sentiment toward demographic terms) and log those occurrences. Over time, you can gather statistics about your model’s bias tendencies. Observability platforms provide the infrastructure to do this systematically, rather than via sporadic manual testing.
Moreover, content evaluation isn’t limited to negative aspects. You might also evaluate correctness or adherence to facts (to catch hallucinations). For example, you could integrate an external knowledge-checker that flags if the model’s answer contradicts a known fact. Tools like TruLens (an open-source LLM evaluation library) provide ways to evaluate outputs on custom criteria, and such libraries can be integrated with observability platforms to log results for each output. Using these, you can monitor the rate of hallucinations or factual errors your LLM produces.
In summary, observability tools enhance the monitoring of bias and safety by providing hooks to evaluate every model output against defined ethical and quality criteria. Through monitors, you get a passive pulse on the model’s behavior (how often it strays off course), and through guardrails you can actively enforce policies in real time. This layered approach helps maintain user trust and compliance with AI safety standards. The ability to quickly detect “anomalous” LLM behavior – whether it’s a sudden lapse in politeness or the use of disallowed language – is a major advantage of having a good LLM monitoring setup in place. It means you can correct course promptly, either by fine-tuning the model, adjusting prompts, or adding stronger safeguards as needed.
Tutorial: using W&B Weave for debugging, tracing, and monitoring
In this tutorial, we will explore how to integrate Weights & Biases Weave with different AI agents and workflows to add observability, tracing, and experiment tracking to your applications. Weave provides a convenient way to log, inspect, and debug interactions between your models and the tasks they perform, whether you are building a simple OpenAI-powered chatbot or a multi-step agent system. By instrumenting your functions and agent calls with Weave, you gain a transparent view into each step of the computation, including inputs, outputs, and intermediate reasoning.
The scripts we will walk through cover three main examples. The first shows a minimal OpenAI chat function wrapped with Weave so each request and response is automatically logged, making it easy to compare runs and inspect results later in the W&B dashboard. The second demonstrates adding Weave tracing to an agent that can search the web and answer questions, allowing you to see the sequence of tool calls and reasoning that led to the final answer. The third example builds a file-navigation agent that can explore a codebase, open relevant files, and summarize them to answer a given question, all while streaming detailed traces into Weave for review.
Together, these examples provide a foundation for building AI-powered applications with built-in transparency and reproducibility. By adding just a few lines of code, you can capture a complete record of your model interactions, making it much easier to debug issues, compare approaches, and share results with others. Here’s the code.
To start, I will show you how to integrate Weave into a basic OpenAI inference call. This is the simplest possible setup and a good way to see how Weave captures and logs your model’s inputs and outputs without changing the logic of your code. When using OpenAI and many other supported libraries, the decorator is optional because Weave can automatically trace calls through its built-in integrations. However, if you want to integrate Weave with a library that doesn’t have built-in support, or you want to trace specific functions in your own code, you can add the @weave.op decorator to capture those calls explicitly. This gives you fine-grained control over what gets logged, while still keeping the integration lightweight and easy to extend.
Before running any of the following code, make sure you install the following libraries:
pip install crewai weave openai openai-agents
Here’s the code using GPT-5 and Weave :
import osimport weavefrom openai import OpenAIweave.init("weave_demo")client = OpenAI()@weave.op()def gpt5_chat(system_prompt: str, user_prompt: str, temperature: float = 0.2) -> str:resp = client.chat.completions.create(model="gpt-5",messages=[{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},],)return resp.choices[0].message.content.strip()# since Weave is integrated with OpenAI, we can use the same function without the @weave.op decorator, and result will still be traceddef gpt5_chat_no_weave_op(system_prompt: str, user_prompt: str, temperature: float = 0.2) -> str:resp = client.chat.completions.create(model="gpt-5",messages=[{"role": "system", "content": system_prompt},{"role": "user", "content": user_prompt},],)return resp.choices[0].message.content.strip()if __name__ == "__main__":out = gpt5_chat("You are concise and factual.","Explain what Weights & Biases Weave is in one sentence.")print(out)out = gpt5_chat_no_weave_op("You are concise and factual.","Explain what Weights & Biases Weave is in one sentence.")print(out)
With this setup, every request to GPT-5 is automatically logged in your W&B workspace, including prompts, responses, and metadata. You can review each run in the Weave UI, compare results, and debug any unexpected behavior. This same pattern works for tracing more complex pipelines or multi-step agent workflows.
Here’s a few screenshots inside Weave after we run our code:


Here’s a few screenshots inside Weave after we run our code. You can see the full trace of each request, including the system and user prompts, the model’s response, and the time taken for the call. Weave also tracks useful metrics like token usage and costs, so you can keep an eye on efficiency while you iterate on your prompts or experiment with different models.
Integrating Weave with OpenAI Agents
Next, I will show how to integrate Weave with an OpenAI-powered agent that can use external tools. This example builds on the basic tracing setup, but adds a layer where Weave records each step the agent takes, including tool calls, intermediate reasoning, and the final answer. By enabling tracing for the agent’s run, you get a clear view of how it arrived at its output, making it easier to debug logic, tune prompts, or verify tool usage.
Here’s the code:
import asyncioimport osfrom agents import Agent, Runner, WebSearchToolfrom agents import set_trace_processorsfrom weave.integrations.openai_agents.openai_agents import WeaveTracingProcessorimport weave# weave tracing onweave.init("openai_agent_websearch")set_trace_processors([WeaveTracingProcessor()])async def main():agent = Agent(name="Web searcher",instructions="Answer briefly. Use the tool if it helps.",tools=[WebSearchTool(user_location={"type": "approximate", "city": "New York"})],)q = "search the web for 'local sports news' and give me 1 interesting update in a sentence."result = await Runner.run(agent, q)print(result.final_output)if __name__ == "__main__":asyncio.run(main())
In this script, Weave is initialized for the project and a tracing processor is added so every step in the agent’s run is logged. The agent is configured with a simple instruction set and a WebSearchTool that can retrieve information based on the user’s location. When the agent receives a query, it decides whether to call the tool, processes the results, and produces a final short answer.
In this example, Weave is integrated into the agent workflow in two main steps. First, weave.init("openai_agent_websearch") sets up a new Weave project so all traces from this script are grouped together in the dashboard. Second, we call set_trace_processors([WeaveTracingProcessor()]), which tells the agent framework to send detailed run data to Weave.
With those two lines, every action taken by the agent, including its thought process, tool calls, and outputs is automatically captured. When the agent runs, you’ll see a complete trace in the Weave UI showing the original query, the decision to call the WebSearchTool, the raw search results, and the final generated answer. This tight integration makes it easy to monitor, debug, and improve agent behavior without changing how the agent itself is written.
Because Weave tracing is enabled, each step is visible in the Weave UI: the original question, the tool invocation, the returned search data, and the final output. This makes it straightforward to see exactly how the agent used the tool to reach its conclusion, and to identify where adjustments might improve accuracy or efficiency.

Utilizing Weave with CrewAI Agents
Finally, I will show you how to integrate Weave with a CrewAI-powered agent that can explore a codebase, open relevant files, and summarize them to answer a question. This setup demonstrates how Weave works in a multi-step, tool-using environment, where the agent needs to coordinate several actions before producing its final answer. Here's the code:
import osfrom typing import Type, List, Dict, Anyfrom pydantic import BaseModel, Fieldfrom crewai import Agent, Task, Crew, Processfrom crewai.tools import BaseToolfrom langchain_openai import ChatOpenAIimport mimetypesimport weaveweave.init("crewai_dir_agent")ROOT = os.getenv("DIR_ROOT", ".")MAX_LIST = int(os.getenv("DIR_MAX_LIST", "800"))MAX_READ = int(os.getenv("DIR_MAX_READ", "5"))MAX_BYTES = int(os.getenv("DIR_MAX_BYTES", "20000"))def _is_text(path: str) -> bool:mtype, _ = mimetypes.guess_type(path)if mtype and mtype.startswith("text/"):return Truetext_exts = {".py", ".md", ".txt", ".json", ".yaml", ".yml", ".toml", ".ini", ".cfg", ".csv", ".ts", ".js", ".java", ".go", ".rs", ".rb"}return os.path.splitext(path)[1].lower() in text_exts# ── Tools ───────────────────────────────────────────────────────────────────────class RepoQueryInput(BaseModel):query: str = Field(..., description="Describe what you are looking for in the codebase.")k: int = Field(default=MAX_READ, description="How many files to return.")class RepoExplorerTool(BaseTool):name: str = "Repo Explorer"description: str = "Given a query, return up to k relevant FILE PATHS from the current directory tree. Output one absolute path per line."args_schema: Type[BaseModel] = RepoQueryInputdef _run(self, query: str, k: int = MAX_READ) -> str:files: List[str] = []for root, _, fs in os.walk(ROOT):for f in fs:p = os.path.join(root, f)if _is_text(p):files.append(os.path.abspath(p))if len(files) >= MAX_LIST:breakif len(files) >= MAX_LIST:breakllm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)prompt = f"""You are a codebase navigator.Pick up to {k} files that best help answer the user question.Return ONLY absolute paths, one per line. Do not add commentary.User question:{query}Candidate files:{os.linesep.join(files)}"""resp = llm.invoke(prompt)return resp.content.strip()class ViewFileInput(BaseModel):filepath: str = Field(..., description="Absolute path of the file to read.")class ViewFileTool(BaseTool):name: str = "View File Content"description: str = f"Read a text file. Truncates to {MAX_BYTES} bytes."args_schema: Type[BaseModel] = ViewFileInputdef _run(self, filepath: str) -> str:path = filepath.strip()try:with open(path, "rb") as f:data = f.read(MAX_BYTES)return data.decode("utf-8", errors="replace")except Exception as e:return f"[read error] {e}"class SummarizeManyInput(BaseModel):question: str = Field(..., description="The user question to answer from files.")files: str = Field(..., description="Absolute file paths separated by newlines.")class SummarizeFilesTool(BaseTool):name: str = "Summarize Files For Answer"description: str = "Read the listed files, then answer the question using only their contents. Include filenames you used."args_schema: Type[BaseModel] = SummarizeManyInputdef _run(self, question: str, files: str) -> str:paths = [p.strip() for p in files.splitlines() if p.strip()][:MAX_READ]chunks = []for p in paths:try:with open(p, "rb") as f:data = f.read(MAX_BYTES)txt = data.decode("utf-8", errors="replace")except Exception as e:txt = f"[read error] {e}"chunks.append(f"FILE: {p}\n---\n{txt}")context = "\n\n".join(chunks)llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0.2)prompt = f"""Answer strictly from the files below. If you cannot find the answer, say so.Include the filenames you relied on.Question:{question}Files:{context}"""resp = llm.invoke(prompt)return resp.content.strip()repo_explorer_tool = RepoExplorerTool()view_file_tool = ViewFileTool()summarize_files_tool = SummarizeFilesTool()# ── Agent ──────────────────────────────────────────────────────────────────────navigator = Agent(role="Directory QA",goal="Given a natural language question about this directory, find relevant files and answer from their contents.",backstory="Knows how to search the tree, open files, and synthesize answers.",tools=[repo_explorer_tool, view_file_tool, summarize_files_tool],allow_delegation=False,verbose=True,llm=ChatOpenAI(model_name="gpt-4o-mini", temperature=0.2),)# ── Task ───────────────────────────────────────────────────────────────────────QUESTION = os.getenv("DIR_AGENT_QUESTION", "What does this repo do?")task = Task(description=(f"Step 1: Use Repo Explorer with the question: {QUESTION}. ""Step 2: Inspect one or more of the returned paths with View File Content. ""Step 3: Call Summarize Files For Answer with the question and the newline list of selected paths to produce the final answer."),expected_output="A concise answer derived from the selected files, including the filenames used.",agent=navigator,)crew = Crew(agents=[navigator],tasks=[task],process=Process.sequential,verbose=True,)if __name__ == "__main__":result = crew.kickoff()print("\nAnswer:\n")print(result)
We start by calling weave.init("crewai_dir_agent") to create a Weave project for this run. This ensures every action and output is logged together in the same workspace. The CrewAI agent is then built with three custom tools: RepoExplorerTool to search for relevant files in the directory, ViewFileTool to open and read those files, and SummarizeFilesTool to combine their contents into a concise response. These tools are registered with the agent so it can call them when needed.
When the script runs, CrewAI orchestrates the process: it uses RepoExplorerTool to pick candidate files, reads their contents with ViewFileTool, and then uses SummarizeFilesTool to answer the original question based on the gathered context. Weave captures this entire sequence, including the intermediate tool outputs, the reasoning behind tool choices, and the final answer. In the Weave UI, you can inspect each step in detail, making it easy to verify that the agent is selecting relevant files, reading them correctly, and producing accurate summaries. This combination of CrewAI for orchestration and Weave for observability gives you both flexibility in how your agents work and deep insight into how they arrive at their results.

Alternative use cases for W&B Weave
W&B Weave is a versatile platform, and its usefulness extends beyond just basic monitoring of single function calls. Here are some alternative use cases and advanced capabilities of Weave that can help in different AI development scenarios:
- Multi-turn conversation tracking: If your application involves multi-turn dialogues (for example, a chat between a user and an AI assistant), Weave’s Threads feature lets you group related calls into a conversation thread. You can assign a thread_id to a series of calls representing one session. In the Weave UI, you’ll then see a threaded view showing the conversation flow turn by turn. This is extremely useful for analyzing how an agent’s responses evolve over a conversation and where it might go off track. Instead of viewing each call in isolation, you can see the whole session’s context.
- Comparing prompts and model versions: Weave makes it easy to experiment with changes and compare outcomes. For instance, you can run your application with two different prompt versions or even two different LLM models, each logging to Weave (perhaps under separate runs or projects). The Weave dashboard allows side-by-side comparison of traces. This helps answer questions like “Did the new prompt reduce the hallucination rate?” or “Is model B faster than model A in this task?” You can literally click between runs to see differences in outputs or metrics. Weave also supports version control for prompts and models in a lightweight way, so you can track which version of a prompt was used in each run.
- Custom dashboards and monitoring: Beyond the default trace view, W&B Weave is part of the W&B ecosystem, which means you can leverage W&B’s dashboarding features. You can log custom metrics (like any counters or performance measures your code computes) and visualize them. For example, if you use a custom scorer that rates the correctness of each answer on a scale of 1-10, you could plot this score over time to ensure it’s improving. Weave’s integration with W&B also means you can set up alerts through W&B’s alerting system (like an email or Slack notification if a metric goes out of bounds). While at the time of writing, fine-grained alerting for Weave’s own monitors is a developing feature, you can always export Weave data and use W&B Core for robust alerting.
- Integration with existing MLops stacks: Weave plays nicely with other tools. If your organization uses MLflow or other experiment trackers, you might still use Weave specifically for the LLM tracing part and link it back to your other system via IDs or metadata. Weave also supports exporting traces (for example, via OpenTelemetry format) if needed, although it primarily shines as a self-contained solution. Additionally, because Weave has both Python and TypeScript SDKs, you can use it in various environments – be it backend services or web applications that call LLMs.
- Guardrails and feedback collection: We touched on guardrails in the context of safety. Weave also helps in collecting user feedback or ratings on LLM outputs (if your application allows users to thumbs-up/down an answer or provide corrections). By logging this feedback alongside the trace (e.g., as a scorer or as metadata), you gather a valuable dataset for improving the model. Over time, you could even fine-tune your LLM on examples where it performed poorly vs. where it succeeded, using the Weave-collected data as your evaluation set.
- Enterprise and self-hosting: For companies with strict data policies, W&B offers Weave in a self-managed deployment as well. So if needed, you can run the Weave backend on your own infrastructure to keep all prompts and outputs in-house. This addresses concerns around data privacy while still benefiting from the powerful debugging and monitoring capabilities.
In essence, W&B Weave adapts to many LLM workflow needs – from development debugging to continuous monitoring in production. Whether you are trying to fix an agent that keeps picking the wrong tool, monitoring a live chatbot for inappropriate content, or evaluating different prompt strategies, Weave provides a unified framework to instrument and observe your application. Its strength lies in combining multiple aspects (traces, metrics, evaluations, guardrails) under one roof, so you don’t have to stitch together disparate tools.
Conclusion
Recap and future directions
In this deep dive, we discussed why monitoring and observability are vital for modern LLM applications and how W&B Weave can substantially improve the debugging and tracing process. To recap, LLM observability gives you visibility into the black-box nature of large language models by logging their every move – from input to output, along with intermediate steps and performance metrics. We highlighted key features to look for in observability tools, such as end-to-end tracing, performance monitoring (latency, token usage, cost), and built-in evaluators for output quality and safety. These features not only help in finding and fixing bugs but also ensure your AI system behaves reliably and ethically over time.
W&B Weave emerged as a powerful solution in this space, enabling developers to iterate faster on LLM-powered applications. By adding minimal instrumentation to your code (like the @weave.op decorator), you gain a rich dashboard of information: you can trace through complex agent decisions, compare different runs, monitor latency and usage, and even set up guardrails to catch unsafe outputs. We saw through the tutorial how a tool like Weave moves us beyond simplistic print debugging to a far more insightful process where every run is recorded and can be examined retrospectively. This not only saves time during development but is also invaluable for monitoring live systems – if an issue occurs in production, you have the trace data to diagnose it post-mortem.
Looking forward, the landscape of LLM monitoring and debugging tools is evolving rapidly. We’re seeing a trend towards more integrated LLMOps platforms that combine tracing, evaluation, and even continuous improvement loops. For example, some tools focus on open telemetry standards to interoperate with existing APM (Application Performance Monitoring) systems, while others emphasize seamless LangChain integration or advanced alerting systems. In 2025 and beyond, we can expect observability solutions to become more robust with features like real-time anomaly detection (e.g., automatic alerts when an LLM’s behavior deviates significantly) and more domain-specific evaluators (for different industries or use cases). W&B Weave itself is actively evolving – features like improved alerting, more integrations, and enhanced UI for comparing experiments are likely on the horizon as the community provides feedback.
In conclusion, investing in LLM observability is no longer optional if you aim to build reliable AI applications. Tools like W&B Weave empower you to debug smarter, iterate faster, and sleep more soundly knowing that you have eyes on your model even after deployment. By capturing rich traces and metrics, you turn the challenge of understanding AI behavior into a manageable, even enjoyable, task of analysis and improvement. As the field of LLMOps grows, embracing such comprehensive monitoring practices will be crucial for anyone looking to push the state-of-the-art in large language model applications.
Frequently asked questions (FAQ)
Help! My program is broken. What should I use to debug my AI app?
If your AI-powered program is not working as expected, you should use an LLM observability or tracing tool to debug it. Traditional debuggers or print statements often aren’t enough for AI apps. A tool like W&B Weave can be very helpful – it will log all the inputs and outputs of your model, so you can see exactly where things are going wrong. By inspecting the trace of each function and model call, you’ll be able to identify whether the issue is with the prompt, the model’s response, or some logic in your code. In short, use an LLMOps debugging toolkit (such as Weave) to get visibility into your AI app’s behavior and diagnose the problem.
My AI agent keeps using the wrong tool – how can I fix it?
When an AI agent consistently picks an incorrect tool in a chain-of-thought or toolkit scenario, the best approach is to trace its decision process. Using a tracing tool (for example, enabling W&B Weave’s integration with your agent framework) will let you see each step the agent takes: which tool it decided to use, what the observation was, and how it arrived at that decision. By examining this trace, you might discover that the agent’s prompt or instructions are ambiguous, causing the confusion. To fix it, you can then refine the agent’s prompt (give clearer instructions on when to use which tool) or adjust the tool descriptions. Essentially, observability exposes the agent’s “thoughts”, so you can tweak its strategy. After changes, run the agent again with tracing on – you’ll quickly see if it now chooses the correct tool.
How can I debug my agentic application beyond print statements?
Agentic applications (like those using LangChain agents or custom reasoning loops) have many moving parts, which makes them hard to debug with just print statements. You should leverage specialized agent tracing and monitoring tools. For example, W&B Weave can automatically capture each action an agent takes (each tool call, model query, etc.) and present it in a structured timeline. This way, instead of manually printing intermediate states (which can be overwhelming and still miss context), you get an interactive trace. The trace will show you the sequence of actions, the intermediate observations, and the final result. By stepping through this trace, you can spot exactly where the agent’s reasoning went astray. Additionally, you can log custom metadata or use debug callbacks provided by agent frameworks. In summary, use an LLM tracing tool or the debugging callbacks of your agent framework to go beyond print debugging.
What are some LLMOps tools that can help me iterate and debug my LLM applications faster?
There are several tools in the LLMOps ecosystem designed to speed up iteration and debugging of LLM apps. Experiment tracking platforms (like W&B itself) help log different runs of your model with various parameters, so you can compare outcomes. Observability and tracing tools (such as W&B Weave, LangSmith by LangChain, or open-source frameworks like Langfuse) focus on capturing detailed traces of LLM calls and agent steps. These let you quickly pinpoint issues in complex chains. Additionally, evaluation toolkits (like TruLens or LlamaEval) can automatically grade your model outputs, which speeds up finding quality regressions. By combining experiment tracking, tracing, and evaluation, you create a tight feedback loop: try a change, see the logged trace and metrics, identify issues, and repeat. In practice, many developers find that using a tool like Weave significantly cuts down the time to debug prompts or tune agent behavior, as compared to blind trial-and-error.
What tools should I use to debug and know my LLM app is working?
To ensure your LLM application is working correctly, you should use tools that provide both visibility and verification. For visibility into what the app is doing, use an observability tool (for instance, Weave) to log all LLM interactions. This will show you if each part of your app is executing as intended. For verification of output quality, use evaluation metrics or tests – for example, define some test cases (inputs with expected outputs) and run them through your app, possibly using an evaluation toolkit to compare results. In combination, these tools let you debug issues and confirm the app’s performance. Specifically, W&B Weave can be used during development to debug, and then in production to monitor. It will give you confidence that the app is working because you can see each prediction or answer it generates, along with metrics like latency and errors. If something goes wrong (like a spike in errors or odd outputs), the tool will make it apparent.
Which tools are best for monitoring AI agents?
Monitoring AI agents requires capturing their complex decision-making process. The best tools for this job are those that were built with LLM agent workflows in mind. W&B Weave is one such tool – it has integrations for popular agent frameworks (OpenAI’s function calling agents, LangChain, etc.) and will log each step an agent takes. Another category of tools includes platform-specific solutions like LangSmith (tailored for LangChain-based apps) which record chain and agent traces. When evaluating the “best” tool, consider ease of integration (does it require minimal code changes?), depth of insight (does it log tool inputs/outputs, model thoughts, etc.?), and real-time capabilities (can it alert you if something goes wrong). Many developers choose Weave for its comprehensive dashboard and the ability to monitor agents in real-time, seeing how frequently each tool is used and if any failures occur. In any case, the best tool is one that fits seamlessly into your development workflow and gives you a clear window into the agent’s operation.
What tools help trace and debug LLM agents?
Tracing and debugging LLM agents can be achieved with LLM observability tools or specialized agent debugging libraries. As mentioned, W&B Weave provides robust tracing for agents: it will show you a trace consisting of the agent’s thought (if available), the action (tool used or query made), and the observation received, repeated for each step. This is essentially what you need to debug an agent. Other tools that help include LangChain’s built-in tracing (LangChain can log to their LangSmith service or to standard output), and OpenAI’s debugging functions (like observing function call requests and responses). However, those may require more manual effort to piece together. Weave’s advantage is aggregating all that info in one timeline. Some open-source alternatives like Langfuse also allow self-hosted tracing of LLM apps. In summary, to trace and debug an LLM agent, use an observability platform or tracing library that records the sequence of actions – this will let you diagnose logic errors or miscommunications between the agent and its tools.
What metrics should I capture for live LLM monitoring?
For live LLM monitoring (in a production scenario), you should capture a mix of performance and quality metrics:
- Latency per request: How long each model call or agent action takes. This helps track response times.
- Throughput or QPS: How many requests are being handled per minute/hour. Useful for capacity planning and spotting traffic spikes.
- Error rate: The frequency of errors/exceptions or the rate of failed requests. This could include timeouts from the model API or your own application errors.
- Token usage: How many tokens each request uses (prompt + completion) and cumulative tokens over time. This correlates with cost if using a paid API.
- Cost: If applicable, an estimate of the API cost or computational cost per request.
- Quality metrics: Any domain-specific metric of output quality. For example, accuracy if you have a way to measure it, or user rating if your users provide feedback. Even proxy metrics like length of response or number of facts can be tracked.
- Safety metrics: Counts of how often the model produces flagged content (e.g., how many outputs were caught by a toxicity filter or needed to be blocked).
- Drift metrics: If you have a reference distribution (like average response length or sentiment), track these over time to see if the model’s behavior drifts.
Using W&B Weave or similar, a lot of these metrics are automatically logged (latency, token usage, errors). You can set up additional logging for any custom metrics relevant to your application. The key is to capture data that lets you know both the health of the service (speed, errors, usage) and the quality of the AI’s responses (correctness, safety, user satisfaction). Monitoring both types will give you a complete picture of your LLM app’s live performance.
How can I detect and alert on anomalous LLM behavior in real time?
Detecting anomalous LLM behavior in real time involves two parts: identifying what constitutes “anomaly” for your application, and having an alerting mechanism. First, define thresholds or conditions that are concerning. For example, an anomaly could be if latency suddenly doubles, or if the model starts outputting a certain unwanted phrase frequently, or if the success rate drops below 90%. Next, ensure your monitoring tool is tracking the relevant signals (latency, content flags, success rate, etc.). Many observability platforms support basic alerting rules – you can set a rule like “if more than 5% of responses in the last 5 minutes were flagged as unsafe, trigger an alert”. When such a condition is met, the tool can send notifications via email, Slack, PagerDuty, etc., depending on integration. If your chosen platform (e.g., Weave) doesn’t yet support a particular alert, you can often export the data to another system or use a script with the tool’s API to periodically check and send alerts. The bottom line is to automate the watching of these metrics: you shouldn’t rely on manually spotting anomalies. For instance, with W&B, you could use W&B Monitors or the core W&B alerting to email you if a run’s metric crosses a threshold. Real-time anomaly detection might also involve more advanced techniques like statistical anomaly detection on output embeddings or language, but start with simple threshold-based alerts for known failure modes. With those in place, you’ll get immediate warnings if your LLM behaves out of the norm, allowing you to intervene quickly.
What solutions are available for tracking and monitoring large language models during deployment?
There are several solutions available, ranging from fully managed platforms to open-source libraries:
- W&B Weave: A comprehensive solution (as discussed) that tracks LLM calls, agent steps, and also provides evaluation and guardrails. It’s cloud-based with on-prem options and integrates with many LLM frameworks.
- LangSmith: Offered by the LangChain team, it’s tailored for applications built with LangChain, logging chains and agents and providing an interface to monitor them.
- OpenTelemetry-based setups: Some teams integrate LLM monitoring into their existing telemetry by instrumenting code to emit spans and traces to systems like Jaeger or Datadog. This can be done via OpenTelemetry SDKs. It’s more DIY, but it’s an option if you need everything in one place.
- Other LLMOps platforms: There are startups and tools (such as Langfuse, Future AGI’s TraceAI, etc.) which focus on LLM monitoring. These often provide similar features like trace visualization, cost tracking, and alerting, each with their own twist (e.g., FutureAGI emphasizes open-source and OTel compatibility).
- Custom Logging + Dashboards: In absence of a dedicated tool, some developers roll out their own lightweight monitoring – for example, logging each request’s info to a database or logging service and building a simple dashboard. This is less feature-rich but can be a quick solution. During deployment, ideally you want a solution that requires minimal changes to your model serving code, and that can scale with your load. Many find that adopting a purpose-built LLM monitoring tool is worth it for the advanced insight it provides. Among the options, W&B Weave is popular for its ease of use and integration with the development workflow (experiment tracking and model registry integration), while others might choose a different tool if it aligns better with their stack. It’s also not uncommon to use a combination (for instance, use Weave for deep traces and an APM tool for system metrics). The good news is that the ecosystem is maturing, so you have plenty of choices to ensure your LLM is well-tracked during deployment.
Add a comment
Iterate on AI agents and models faster. Try Weights & Biases today.