Skip to main content

Evaluating your MCP and A2A agents with W&B Weave

Created on May 9|Last edited on August 19
As AI-powered agents become increasingly sophisticated and interconnected, developers face the challenge of ensuring seamless collaboration, observability, and integration across diverse platforms and vendors. Two major protocols—Anthropic’s Model Context Protocol (MCP) and Google’s Agent2Agent (A2A)—are paving the way for standardized, interoperable agent communication and context sharing. Yet, building and monitoring robust agentic systems remains a complex task.
W&B Weave dramatically simplifies these challenges. With integrated tracing and observability for MCP (and A2A coming soon!), Weave delivers out-of-the-box visibility into agent systems, enabling you to evaluate, monitor, debug, and iterate on your agentic AI applications.
In this article, we'll give a high-level overview of MCP and A2A, discuss the challenges of building agentic systems, and show how Weave brings structure and insight to agent development.


Table of contents



What is the Model Context Protocol (MCP)?

The Model Context Protocol is an open standard developed by Anthropic to unify how LLM applications interact with external tools, data, and prompts. MCP acts like a USB port for AI agents: a universal interface that enables secure access to real-time data sources, APIs, and external actions - without the need for custom integrations for every service.
MCP organizes context around three core primitives:
  • Tools: External functions that the model can execute (e.g., running code, sending emails, querying APIs).
  • Resources: Structured data that the model can reference (e.g., documents, database records, logs).
  • Prompts: Predefined templates that structure and guide LLM behavior for consistency and efficiency.
By standardizing how these resources are described and accessed, MCP enables LLM-powered applications to share a common infrastructure, reducing complexity and redundancy. In effect, any MCP-compliant client can leverage any MCP server’s capabilities, improving interoperability, security, and scalability as agentic AI moves from isolated islands to connected ecosystems.

What is Agent2Agent and how does it work with MCP?

Agent2Agent (A2A) is an open standard introduced by Google to tackle a complementary challenge: enabling autonomous AI agents - often built on different platforms and by different providers - to discover, communicate, and collaborate effortlessly. While MCP is focused on providing context access and structured tool use for LLMs, A2A focuses on the orchestration of multi-agent workflows, secure communication, and 8 MCP and A2A?
Anthropic’s Model Context Protocol (MCP) and Google’s Agent2Agent (A2A) are both open standards designed to unlock interoperability for AI agents, but they address different challenges and operate at different levels of the agent stack.
MCP focuses on connecting an individual agent (often a language model-powered assistant) to external tools, structured data, and predefined prompts. It provides a standardized framework for accessing and executing resources such as APIs, files, and functions. With MCP, AI applications can easily leverage existing infrastructure for obtaining real-time information or performing external actions, all through a universal protocol rather than a patchwork of custom connectors.
A2A, in contrast, is about enabling communication and collaboration between autonomous agents, potentially built by different teams or running on separate platforms. One of A2A’s key innovations is agent discoverability via Agent Cards: every agent exposes a machine-readable “Agent Card” that advertises its skills, supported modalities, authentication requirements, and connection details. Other agents can query and understand these Agent Cards to dynamically find the right partner for a given job, negotiate communication parameters, and coordinate delegation of work. This creates a shared language and a set of protocols that let agents seamlessly discover, interact, and cooperate across organizational or vendor boundaries.
In summary, MCP serves as a universal “adapter” for connecting an agent to tools and data sources, while A2A provides the protocols and self-description mechanisms (through Agent Cards) that let agents securely find and work with each other. Used together, they empower rich, context-aware agentic applications that are both extensible individually and interoperable as teams.

The MCP integration with W&B Weave

W&B Weave provides built-in support for the MCP protocol, making it easy to evaluate, monitor, and iterate on agentic applications powered by Anthropic’s standard. Out-of-the-box tracing for MCP servers and clients enables you to automatically capture end-to-end traces from agents built using MCP, without requiring any custom instrumentation.
Usually, developers would need to manually wrap individual functions or endpoints using the @weave.op decorator in order to capture detailed traces and telemetry within Weave. For example, if you were running a MCP-based server in Python, you’d typically import Weave, initialize it, and then decorate any function you wanted to trace - such as external tool calls or resource fetches - to ensure comprehensive observability throughout your agent's workflow.
However, with Weave’s pre-built MCP integration, this process is now simplified. You only need to import and initialize Weave at the start of your MCP server and MCP client. Once enabled, Weave will automatically trace all MCP calls and interactions, seamlessly capturing execution flow, input/output parameters, and performance data across every agentic workflow. This means that every tool invocation, resource access, or prompt used by your MCP-compliant agents will be visible in Weave, providing deep insight without cluttering your code with additional decorators or tracing logic.
This streamlined approach not only eliminates manual effort but also ensures consistent, granular observability across your entire agentic system. As you iterate, refine, and scale your MCP-powered agents, Weave makes it effortless to monitor real-world behavior, debug issues, and optimize for performance—all while maintaining a clear historical record of system changes. By comparing traces from previous and current versions, you can track progress, validate improvements, and confidently evolve your agentic applications over time. In short, Weave now brings frictionless visibility and actionable insight to every MCP agent, raising the bar for observability and governance in modern AI systems.

Demo: Combining MCP and A2A with Weave

Now we will demonstrate how to combine the power of both A2A and MCP while utilizing the Weave MCP integration. This demo won’t go into the exact specifics of how A2A and MCP works, so if you are new to these protocols, feel free to check out my other tutorials showcasing how to use MCP and A2A. For this tutorial, I assume you have launched one of the demo currency conversion agents found in the A2A repo here. You can launch the agent with the following command (inside the A2A/samples/python/agents directory of A2A) :
uv run . --host 0.0.0.0 --port 1001
With your A2A agent running on port 1001, the next step is to build an MCP server using the FastMCP library. This MCP server will act as a bridge, exposing a tool to MCP clients (like Claude Desktop or other LLM apps) that, under the hood, discovers available A2A agents, determines the best one for each task, and relays requests and responses accordingly. When you initialize Weave in this server, all MCP tool invocations are automatically traced, providing visibility into how the agent is working!
Here’s a simple script showing how to integrate A2A and MCP while also utilizing Weave! Make sure to replace your api key in this line of the script: os.environ["GEMINI_API_KEY"] = "your_gemini_api_key"
import asyncio
import os
import uuid
import logging
import base64
from typing import Optional, List, Dict

import httpx
from litellm import completion

from mcp.server.fastmcp import FastMCP
import weave

weave.init("wv_mcp")

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

AGENT_CARD_PATH = "/.well-known/agent.json"
HOST = "localhost"
START_PORT = 1000
END_PORT = 1010
MODEL_NAME = "gemini/gemini-2.0-flash"



# --- export/set API key for downstream libraries (if not already set elsewhere) ---
os.environ["GEMINI_API_KEY"] = "your_gemini_api_key"

def create_routing_prompt(history: List[Dict], agents: List[Dict]) -> List[Dict]:
agent_list = "\n".join(
f"- {a['card'].get('name', 'Unknown')}: {a['card'].get('description', '')}" for a in agents
)
convo = []
for msg in history:
role = "User" if msg["role"] == "user" else "Agent"
convo.append(f"{role}: {msg['text']}")
convo_str = "\n".join(convo)
content = (
"You are an expert router for user requests. "
"Given the following agents:\n"
f"{agent_list}\n\n"
"Below is the full conversation so far. "
"Pay closest attention to the user's most recent request at the end. "
"Choose the single best matching agent by exact name to handle the user's request, IN CONTEXT of the conversation.\n\n"
f"Conversation:\n{convo_str}\n\nAgent name:"
)
return [{"role": "user", "content": content}]

def make_agent_instruction(history: List[Dict], agent_card: Dict) -> str:
convo = "\n".join(
f"User: {msg['text']}" if msg["role"] == "user" else f"Agent: {msg['text']}"
for msg in history
)
agent_desc = agent_card.get('description', 'No agent description provided.')
agent_name = agent_card.get('name', '[Agent name missing]')
return (
f"You are the agent: {agent_name}.\n"
f"Your job: {agent_desc}\n\n"
"Conversation so far:\n"
f"{convo}\n\n"
"Based on this conversation, take the user's intended action or answer their last request, using all necessary info from the dialog.\n"
"Return only the task/instruction that the agent will accept ."
)

async def fetch_agent_card(host: str, port: int) -> Optional[Dict]:
url = f"http://{host}:{port}{AGENT_CARD_PATH}"
try:
async with httpx.AsyncClient(timeout=3) as client:
resp = await client.get(url)
resp.raise_for_status()
card = resp.json()
logger.info(f"Agent found: {card.get('name', '[no-name]')} on {host}:{port}")
return {"host": host, "port": port, "card": card}
except Exception as e:
logger.debug(f"No agent at {host}:{port} ({e})")
return None


async def discover_agents(host: str, start_port: int, end_port: int) -> List[Dict]:
tasks = [fetch_agent_card(host, port) for port in range(start_port, end_port + 1)]
results = await asyncio.gather(*tasks)
return [r for r in results if r]


async def send_instruction_to_agent(agent: Dict, instruction: str) -> Optional[Dict]:
url = f"http://{agent['host']}:{agent['port']}/"
jsonrpc_id = str(uuid.uuid4())
payload = {
"jsonrpc": "2.0",
"id": jsonrpc_id,
"method": "tasks/send",
"params": {
"id": jsonrpc_id,
"message": {
"role": "user",
"parts": [
{"type": "text", "text": instruction}
]
},
"acceptedOutputModes": ["text/plain",],
}
}
try:
async with httpx.AsyncClient(timeout=30) as client:
resp = await client.post(url, json=payload)
resp.raise_for_status()
data = resp.json()
return data
except Exception as e:
logger.error(f"Error sending query to agent: {e}")
return None

def extract_agent_text(agent_response: Dict) -> str:
"""Extracts text from agent response artifacts."""
result = agent_response.get("result", {})
artifacts = result.get("artifacts", [])
texts = []
for artifact in artifacts:
for part in artifact.get("parts", []):
if part.get("type") == "text":
texts.append(part.get("text", ""))
return "\n".join(texts)

# # ------ MCP TOOL WRAPPER ------


def extract_agent_content(agent_response: Dict) -> Dict:
"""
Extracts texts and images from agent response artifacts.

Returns a dict:
{
"texts": [ ... ],
"images": [ { "filename": ..., "bytes": ..., "type": ... }, ... ]
}
Images are decoded but also stored as bytes (and filenames if saved locally).
"""
result = agent_response.get("result", {})
artifacts = result.get("artifacts", [])
texts = []
images = []

for artifact in artifacts:
for part in artifact.get("parts", []):
ptype = part.get("type")
if ptype == "text":
texts.append(part.get("text", ""))
elif ptype in ("image/png", "image/jpeg", "image/jpg"):
b64data = part.get("text") or part.get("bytes") or ""
if not b64data:
logger.warning("Image artifact part missing data")
continue
try:
image_bytes = base64.b64decode(b64data)
ext = "png" if "png" in ptype else "jpg"
filename = f"agent_image_{uuid.uuid4().hex}.{ext}"
# Optional: save to file
with open(filename, "wb") as f:
f.write(image_bytes)
images.append({"filename": filename, "bytes": image_bytes, "type": ptype})
texts.append(filename)
except Exception as e:
logger.error(f"Failed to decode/save image: {e}")
elif ptype == "file":
file_info = part.get("file", {})
b64data = file_info.get("bytes")
mime = file_info.get("mimeType", "")
if not b64data:
logger.warning("'file' artifact missing 'bytes' data")
continue
try:
file_bytes = base64.b64decode(b64data)
ext = "bin"
if "png" in mime:
ext = "png"
elif "jpeg" in mime or "jpg" in mime:
ext = "jpg"
filename = f"agent_file_{uuid.uuid4().hex}.{ext}"
with open(filename, "wb") as f:
f.write(file_bytes)
images.append({"filename": filename, "bytes": file_bytes, "type": mime})
texts.append(filename)
except Exception as e:
logger.error(f"Failed to decode/save file artifact: {e}")
# (You can add more media/part types as needed)

return {"texts": texts}


# `history`: [{"role": "user"|"agent", "text": "..."} ..]
# Returns: Dict with "agent":..., "response": ..., and updated "history"

async def run_a2a_router(query: str, history: Optional[List[Dict]] = None) -> Dict:
"""
Runs a2a multi-agent router: discovers agents, routes, relays, and returns agent response.
"""
# 1. Discover agents
agents = await discover_agents(HOST, START_PORT, END_PORT)
if not agents:
return {"error": "No agents discovered."}

if history is None:
history = []

history = history.copy()
history.append({"role": "user", "text": query})

# 2. Route to best agent via LLM
prompt_messages = create_routing_prompt(history, agents)
response = completion(
model=MODEL_NAME,
messages=prompt_messages,
max_tokens=16,
temperature=0.0,
)
agent_name = response.get("choices", [{}])[0].get("message", {}).get("content", "").strip()
matched_agent = None
for agent in agents:
if agent["card"].get("name", "").lower() == agent_name.lower():
matched_agent = agent
break
if not matched_agent:
matched_agent = agents[0]

# 3. Build agent instruction and relay
agent_instruction = make_agent_instruction(history, matched_agent["card"])
agent_response = await send_instruction_to_agent(matched_agent, agent_instruction)
if not agent_response:
return {"error": f"Agent '{matched_agent['card'].get('name')}' did not return a valid response."}

reply_text = extract_agent_text(agent_response)
if reply_text:
history.append({"role": "agent", "text": reply_text})

return {
"agent": matched_agent["card"].get("name"),
"response": reply_text,
"history": history,
"raw_response": agent_response,
}



mcp = FastMCP("mcp_server")

@mcp.tool()
async def handle_route_to_agents_with_a2a(query: str, history: Optional[List[Dict]] = None) -> Dict:
"""use this tool for any queries requiring up to date information"""
return await run_a2a_router(query, history)



if __name__ == "__main__":
mcp.run(transport='stdio')

With this setup, whenever an MCP client (such as Claude Desktop) invokes the handle_route_to_agents_with_a2a tool, your server will automatically discover all A2A agents registered on the network, craft a routing prompt to select the best agent using LLM-powered selection, and relay the request - including any conversation history. The chosen agent executes the task, and the response is piped back through the MCP protocol.
Thanks to the Weave integration, every request to the handle_route_to_agents_with_a2a tool traced and visualized in the Weave platform. You can explore full execution traces, monitor agent collaboration performance, and even compare different iterations of your multi-agent infrastructure over time.
This fusion of A2A and MCP protocols, stitched together with Weave’s observability, offers a seamless path toward building and scaling agentic systems that are both highly interoperable and profoundly transparent. Whether you're investigating workflow bottlenecks, analyzing agent routing decisions, or tracking improvements across deployments, this architecture gives you the insight and control to ship robust agent applications with confidence.
Now we can try out our system inside Claude Desktop.


After our agent completes our query, we can navigate to Weave and see the trace for our interaction with the agent. Since Weave is now directly integrated with MCP, we only need to import and initialize weave, and MCP tool calls will automatically be logged inside Weave.


The power of Weave

For anyone developing MCP servers, I highly recommend Weave as it also provides detailed tracing that helps surface hidden performance problems, improves agent selection logic, and gives a much clearer picture of multi-agent system behavior under real-world conditions.
When you're building multi-agent systems, the real work isn’t just getting something to respond. It’s about iterating over the design - tweaking how agents interact, changing selection logic, improving prompt templates, adjusting fallback strategies, optimizing network communication - and watching how those changes actually affect behavior across many real conversations.
Without deep tracing, that iteration process is slow and guesswork-driven. You tweak something, run a few tests manually, and hope it improved. With Weave, every single tool invocation across all agents is captured with full context: which agents were discovered, why one was selected, what prompt was used to route the task, how long each response took, and what final outputs were generated.
This lets you move fast: you can trace a failed or slow workflow, drill into exactly where the issue came from (bad selection, agent error, slow response), fix it, redeploy, and immediately verify if the new behavior is better. You don’t have to build custom logging for each change or manually monitor agents anymore - the whole system is observable out of the box.
When scaling up agents across tasks or teams, this kind of feedback loop is the difference between shipping something functional and shipping something reliable and scalable. Weave turns iteration into a real, visible engineering process instead of intuition and hope.

W&B MCP Server demo

Now that we have shown off the Weave integration with MCP, I want to show a really neat new way to understand projects within W&B. Weights & Biases now has a MCP Server that supports querying information about projects. To get started you will need to download the server from this repo, and add the JSON config to your claude_desktop_config.json file. After the server is set up, you can launch a LLM client like Claude Desktop, and ask queries about specific projects.
For the demo, I’ll first do a few test runs using W&B models. I will do a simple test of the few shot generalization capabilities of a few image classification models on the CIFAR-10 dataset. Here’s the code for my run:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np
import wandb
from collections import defaultdict

def get_fewshot_loader(dataset, num_per_class, batch_size=32):
"""Returns a DataLoader with num_per_class images per class."""
targets = np.array(dataset.targets)
idxs = []
class_counts = defaultdict(int)
for idx, label in enumerate(targets):
if class_counts[label] < num_per_class:
idxs.append(idx)
class_counts[label] += 1
if all(class_counts[k] >= num_per_class for k in range(10)):
break
sampler = torch.utils.data.Subset(dataset, idxs)
return torch.utils.data.DataLoader(sampler, batch_size=batch_size, shuffle=True, num_workers=2)

def evaluate(model, dataloader, device):
model.eval()
correct, total = 0, 0
with torch.no_grad():
for x, y in dataloader:
x, y = x.to(device), y.to(device)
logits = model(x)
pred = logits.argmax(1)
correct += (pred == y).sum().item()
total += len(y)
return correct / total

def main():
wandb.init(project="cifar10-fewshot-eval")

# Use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Pretrained models and their final layer info
models_info = {
'resnet18': {
'model': torchvision.models.resnet18(weights='DEFAULT'),
'final_layer': ('fc', 512)
},
'densenet121': {
'model': torchvision.models.densenet121(weights='DEFAULT'),
'final_layer': ('classifier', 1024)
},
'vgg16': {
'model': torchvision.models.vgg16(weights='DEFAULT'),
'final_layer': ('classifier', 4096, 6) # index 6 in classifier
},
'efficientnet_b0': {
'model': torchvision.models.efficientnet_b0(weights='DEFAULT'),
'final_layer': ('classifier', 1280)
}
}

transform = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])

trainset = torchvision.datasets.CIFAR10(
root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(
root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(
testset, batch_size=128, shuffle=False, num_workers=2)

fewshot_loader = get_fewshot_loader(trainset, num_per_class=10, batch_size=16)

for name, info in models_info.items():
print(f"\nTraining and evaluating {name} ...")
model = info['model']
# Replace final layer for CIFAR-10
if name == 'vgg16':
clf = model.classifier
clf[6] = nn.Linear(info['final_layer'][1], 10)
model.classifier = clf
else:
setattr(model, info['final_layer'][0], nn.Linear(info['final_layer'][1], 10))
model.to(device)

# Only train the head
for param in model.parameters():
param.requires_grad = False
if name == 'vgg16':
for param in model.classifier[6].parameters():
param.requires_grad = True
else:
for param in getattr(model, info['final_layer'][0]).parameters():
param.requires_grad = True

optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.01)
criterion = nn.CrossEntropyLoss()

# Quick training (2 epochs)
model.train()
for epoch in range(2):
total_loss = 0
for x, y in fewshot_loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
total_loss += loss.item() * len(y)
avg_loss = total_loss / len(fewshot_loader.dataset)
print(f" Epoch {epoch+1}: avg loss {avg_loss:.4f}")

# Eval
acc = evaluate(model, testloader, device)
print(f"{name}: Test set accuracy after few-shot training: {acc:.4f}")
wandb.log({f"{name}_fewshot_acc": acc})

wandb.finish()

if __name__ == "__main__":
main()

I’ll be the first to admit that this isn’t the most scientifically rigorous experiment as hyperparameters can definitely play a big role in few-shot generalization capabilities. After running the experiment, we can navigate to Claude Desktop and simply ask about the results for the project.
Next, Claude will use the tools in the W&B MCP Server to find out more information about the project.



What’s happening under the hood is that Claude is making an MCP tool call, and the server is executing a query to the Weights & Biases' API to retrieve the project metadata. This query result is then parsed and summarized automatically inside the Claude chat, without any manual digging through the UI or dashboards.
Next, I can even automate the process of creating a report to document the results!

Claude generated a W&B Report summarizing the model performance. The report includes a breakdown of the few-shot accuracy scores for each model tested, a bar chart visualizing the results, and clear next step recommendations based on the findings.


This type of tool definitely feels like the future of AI research. Hopefully, it will soon be possible to simply instruct an AI agent to come up with several unique (and high quality) research ideas, create experiments to test those ideas, and ultimately make new AI breakthroughs! It should be interesting to see how these sorts of agents will evolve!

Conclusion

As AI systems evolve toward more complex, multi-agent architectures, the need for visibility, interoperability, and rapid iteration becomes greater than ever. Standards like MCP and A2A offer the foundation for connecting agents and tools across platforms, but building truly scalable, understandable systems still requires powerful observability.
Weave bridges that gap. With first-class support for tracing MCP workflows, agent routing, tool invocation, and now Weights & Biases project queries, Weave gives developers the control they need to ship robust multi-agent systems faster. Instead of building your own telemetry stack, you can focus on improving agent behavior, refining routing logic, and accelerating experiments - all while keeping a complete, searchable history of your system’s evolution.
Native A2A support for Weave is also coming soon, however, if you want to integrate Weave into your A2A systems today, it's as simple as importing and initializing Weave at the start of your application and using @weave.op to wrap any function you want traced. Whether you're building a simple LLM app or a full multi-agent workflow, Weave makes it easy to capture traces, monitor real-world behavior, and iterate faster. If you're working with LLM-driven agents, autonomous workflows, or building on MCP and A2A, Weave is the observability platform you’ll want in your stack.

Iterate on AI agents and models faster. Try Weights & Biases today.