Monitoring trustworthy agents with Vijil and Weave

A guide to enabling observability and security for your agents with just a few lines of code
Created on October 30|Last edited on October 30
Comment
﻿
If you've built an AI agent but haven't deployed it into production yet because of security and safety concerns, you aren't alone.  With a growing number of attacks and vulnerabilities being discovered every day, many enterprises take on real risks to reputation and revenue when they deploy an AI agent into the real world. 
At Vijil, we're helping enterprises build and operate trustworthy agents. We provide tools to test and improve the reliability, security, and safety of LLM-based applications. 
In this article, learn how you can defend your AI agents using Vijil Dome monitoring that defense with Weights & Biases Weave.  
Table of ContentsTable of ContentsIntroductionBuilding trustworthy agents with Vijil DomeAdding Observability with W&B WeaveConclusion
﻿
IntroductionWhile agents and other LLM-based applications have surged in popularity, enterprises remain cautious about deploying them into production. Even though language models are fluent and capable, LLMs are inherently probabilistic, incapable of logical reasoning, and vulnerable to attack. This can make them hard to trust "out-of-the-box." In other words: you'll want to test your agents before you can trust your agents.  
For example, the code snippet below instructs a simple "writing assistant" to help a user with writing and editing tasks.  
from openai import OpenAI
﻿
client = OpenAI()
system_prompt = """
You are WriterBot - a writing assistant AI. 
Help suggest titles, proofread and rephrase my writing, and give me unique and interesting story suggestions when asked. 
"""
# LLM Query
def ask_llm(model: str, query: str) -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role" : "system", "content" : system_prompt},
            {"role": "user", "content": query}
        ],
    )
    return response.choices[0].message.content
﻿
For benign queries, the LLM responds as expected:
﻿
But the LLM is capable of so much more—and not all of those capabilities are benign. 
While on the job, the "writing assistant" can help you create an explosive by suggesting the use of Ammonium Nitrate extracted from fertilizers. 
﻿
This example is just for illustration purposes but you can easily imagine prompting LLMs to generate extremely malicious and harmful responses. 
If your organization enforces a policy that its AI agent should not provide information to a user that would encourage or aid them in harming themselves or others, you need guardrails around the LLM inside the agent. You would want to catch an out-of-policy response when the LLM generates it, or better yet, flag the query as malicious and never send it to the LLM in the first place. Also, you will want to monitor the LLM to make sure that the guardrails are in fact working as expected. ﻿﻿
Building trustworthy agents with Vijil DomeVijil Dome is an agent guardrails library designed to improve the trustworthiness of AI agents and other LLM-based applications. It detects and blocks jailbreaks, prompt injections, PII leaks, loss of privacy and confidentiality, and the generation of toxic, stereotypical, biased, and unethical responses.  Vijil Dome is more comprehensive than most guardrails, more accurate in its detection, and faster in its enforcement. It's also uniquely flexible, enabling you to customize the guardrails to adapt them exquisitely to your agent use case.
To measure the trustworthiness of an agent, we use the Vijil Evaluate engine which uses over 200,000 diverse prompts to score the reliability, security, and safety of an agent. Using  Vijil Trust Score, we can see the extent to which Vijil Dome improves the trustworthiness of an agent. For example, the Vijil Trust Score for Llama 3.1 70B  improves by almost 30% with Vijil Dome.    



































Model (Hub) x Trust ScoreWithout Vijil DomeWith Vijil Dome
GPT 4o (OpenAI)80.4084.82
Anthropic Claude 3.5 (AWS Bedrock)75.8480.12
Gemini 1.5 Flash (Vertex AI on GCP)73.8277.45
Meta LLama 3.1 70B (Together AI)60.6678.11
Mistral-Nemo (Vertex AI on GCP)61.5371.41
﻿
﻿
Integrating with Vijil Dome into an agent is easy. Simply pass the input and output to the ask_llm function through the guard_input and guard_output methods: 
from vijil_dome import Dome, get_default_config
from openai import OpenAI
﻿
# Dome is compatible with both Sync and Async clients
client = OpenAI()
﻿
# LLM Query
def ask_llm(model: str, query: str) -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}],
    )
    return response.choices[0].message.content
﻿
# Create your dome instance
dome = Dome(get_default_config())
﻿
# Guarded completion
def ask_guarded_client(model: str, query: str):
    # Scan the input
    input_scan_result = dome.guard_input(query)
    if input_scan_result.is_safe():
        # If the input is safe, use the sanitized prompt
        client_output = ask_llm(model, input_scan_result.guarded_response())
        # pass the output through the output guardrail
        output_scan_result = dome.guard_output(client_output)
        return output_scan_result.guarded_response()
    else:
        return input_scan_result.guarded_response()
With Vijil Dome, the example prompt that we tried earlier no longer produces a harmful response.
﻿
 
Adding Observability with W&B WeaveVijil Dome wraps guards around the LLM calls in an application. To ensure observability over these guardrails, we need to trace each and every request that passes through Vijil Dome. You can now do this with one line of code that integrates the Vijil Dome guardrail calls with Weave. The apply_decorator function adds all Dome executions to the appropriate Weave Trace:
# Replace the default guard input and output functions with the weave versions
dome.apply_decorator(weave.op())
﻿
# Add the decorator to the ask_guarded_client function
@weave.op()
def ask_guarded_client(model: str, query: str):
	# The same code as earlier
With this, the trace includes the inputs and outputs from every guard used in the call. Additionally, it includes execution time and debugging metadata. From the Weave UI, we can see every guard's execution, input, output and processing time. We can even see the data associated with the guards that under the hood use an LLM-based detector. 
The example below shows how the input guards processed the query "What is the capital of Grenada?"  The query passed through two guards - one for security and another for moderation. The security guardrail uses two methods: an ML model and an LLM. Meanwhile, the moderation guardrail uses three methods: an ML model, an LLM, and a keyword blocklist. 
We can drill down into any of these individual detection methods to obtain debugging information, execution time, and more:
﻿
﻿
Vijil Dome is designed to evolve and improve as fast as the LLMs that it defends by learning from feedback. Weave makes it easy to provide feedback to Dome. To flag a guardrail detection for review or improvement, our partners can annotate a trace with a thumbs down (👎).
﻿
In fact, at Vijil, we query Weave for the feedback from our partners so we can keep improving Dome.
import weave
﻿
client = weave.init("dome-example")
﻿
# Find all feedback objects with a specific reaction. You can specify an offset and limit.
thumbs_up = client.feedback(reaction="👎")
﻿
# After retrieval you can view the details of individual feedback objects and traces
# For objects in Dome, this includes the input query, as well as the objects saved in the trace
for f in thumbs_up:
    ref = f.weave_ref
    call_id = ref.split("/")[-1]
    call = client.call(call_id)
    print(call)
    print("------------")
﻿
ConclusionIf you're planning to deploy an AI agent, we recommend that you defend it holistically and monitor that defense continuously.  You can use Vijil Dome to create a perimeter defense around your LLM-based agent, Vijil Evaluate to measure the difference, and W & B Weave to monitor that defense continuously. If you're building an LLM-based application and want to try out these security and observability mechanisms, shoot us a message at contact@vijil.ai
﻿
Model (Hub) x Trust Score	Without Vijil Dome	With Vijil Dome
GPT 4o (OpenAI)	80.40	84.82
Anthropic Claude 3.5 (AWS Bedrock)	75.84	80.12
Gemini 1.5 Flash (Vertex AI on GCP)	73.82	77.45
Meta LLama 3.1 70B (Together AI)	60.66	78.11
Mistral-Nemo (Vertex AI on GCP)	61.53	71.41
Add a comment
Vin Sharma • 12 months ago
, a real-time perimeter defense solution, delete this in the interest of flow
Tags: Articles, LLM, GenAI
Iterate on AI agents and models faster. Try Weights & Biases today.