US AISI’s findings on agent hijacking evaluations

Created on March 17|Last edited on March 17
Comment
As AI-powered agents become more advanced, their security risks also grow. One major concern is agent hijacking, where attackers manipulate an AI system into performing unintended and potentially harmful actions. To better understand and mitigate this risk, the U.S. AI Safety Institute (US AISI) conducted experiments using AgentDojo, an open-source framework designed to test AI vulnerabilities. Their research led to four major insights on improving AI agent security.  
Understanding AI Agent Hijacking  Agent hijacking is a security vulnerability where an attacker embeds malicious instructions within seemingly normal data—such as an email, file, or website—that an AI agent interacts with. Because these agents merge trusted instructions with external data, they may unknowingly follow the attacker’s commands. This can result in dangerous actions like data leaks, unauthorized transactions, or even executing malicious code.  
Evaluating AI Security with AgentDojo  To test AI agent vulnerabilities, US AISI used AgentDojo, a leading open-source security evaluation tool developed at ETH Zurich. Their tests focused on Claude 3.5 Sonnet, a model by Anthropic that has shown resilience against hijacking attacks. AgentDojo simulates four environments—Workspace, Travel, Slack, and Banking—where AI agents complete various tasks. In each scenario, the AI is given a legitimate task but also encounters hidden malicious instructions. If the agent follows these harmful instructions, it is considered successfully hijacked.  
Key Insights from US AISI’s Research  Continuous improvement and expansion of evaluation frameworks are essential. AI threats evolve rapidly, meaning security evaluation tools must be regularly updated. US AISI improved AgentDojo by fixing bugs, adding asynchronous execution support, and integrating with Inspect. They also introduced new security tests focused on high-risk vulnerabilities, including remote code execution, database exfiltration, and automated phishing. These additions helped assess risks not previously covered in the framework.  
Evaluations must adapt as AI models improve. While newer AI systems may resist known attack methods, attackers continuously develop new strategies. US AISI found that Claude 3.5 Sonnet was significantly more resistant to older hijacking techniques, but when tested against newly designed attacks, its vulnerability increased. To account for this, US AISI collaborated with the UK AI Safety Institute on a red teaming exercise, which increased the measured attack success rate from 11% to 81%. This demonstrates that AI security evaluations must evolve alongside the models they test.  
﻿
Analyzing attack success rates by specific tasks provides deeper risk insights. Measuring overall attack success is useful, but breaking it down by task type offers a clearer picture of potential dangers. Some attacks—such as executing a malicious script—were far more successful and had severe consequences compared to others, like sending a harmless email. Understanding which attack types are most likely to succeed allows for more targeted security improvements.  
﻿
Testing attacks multiple times gives a more accurate picture of risk. Many security evaluations only test whether an AI system can be hijacked on the first attempt. However, because AI models are probabilistic, they may behave differently each time. US AISI found that when they repeated attacks multiple times, the average success rate increased from 57% to 80%. This suggests that real-world attackers, who can try multiple times, have a much higher chance of success than single-attempt evaluations suggest.  
﻿
Looking Ahead  AI agent hijacking is an ongoing challenge that will require continuous research and adaptation. Strengthening evaluation frameworks, testing new attack strategies, and improving security measures will be essential as AI systems become more integrated into everyday applications. Future research should focus on developing and validating more effective defenses to ensure AI agents can operate safely and securely.
﻿
Add a comment
Tags: ML News
Iterate on AI agents and models faster. Try Weights & Biases today.