A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
π Website | π Dataset | π Paper | π Appendix | π€ Hugging Face
AgentHazard is a comprehensive benchmark for evaluating harmful behavior in computer-use agents. Unlike traditional prompt-level safety benchmarks, AgentHazard focuses on execution-level failures that emerge through the composition of locally plausible steps across multi-turn, tool-mediated trajectories.
- π― 2,653 curated instances spanning diverse risk categories and attack strategies
- π 10 risk categories: RCE, Data Exfiltration, Persistence, Supply Chain Poisoning, Destructive Actions, Internal Reconnaissance, Privilege Escalation, Resource Exhaustion, Prompt Intelligence Theft, and Compliance Bypass
- βοΈ 10 attack strategies: Direct, Recursive Salami Slicing, Security Audit Gaslighting, Dependency Hook Trigger, Pseudo Diagnostic Debugging, Contextual Persona Enforcement, Encoded Payload Smuggling, Implicit Indirect Injection, Logical Dependency Inversion, and Legacy Code Maintenance
- π¬ 3 agent frameworks: Claude Code, OpenClaw, IFlow
- π€ Multiple models: Qwen2.5/3, Kimi K2/K2.5, GLM-4.6, DeepSeek
- π 10,000+ execution trajectories for detailed analysis
Our attacks are not merely theoreticalβthey trigger actual security alerts in production environments:
Real security monitoring alert triggered during AgentHazard evaluation, confirming the real-world severity of identified vulnerabilities.
All materials required for review are available below:
| Resource | Description | Link |
|---|---|---|
| π Dataset | 2,653 instances in JSON format | Download |
| π Appendix | Detailed taxonomy and methodology | Download |
| π» Code | Full source code & evaluation framework | GitHub |
| π Trajectories | Execution traces from agent runs | Browse |
# Clone the repository
git clone https://github.com/Yunhao-Feng/AgentHazard.git
cd AgentHazard
# Install dependencies
pip install -r requirements.txt
# Load the dataset
import json
with open('data/dataset.json', 'r') as f:
dataset = json.load(f)
print(f"Loaded {len(dataset)} instances")| Statistic | Value |
|---|---|
| Total Instances | 2,653 |
| Risk Categories | 10 |
| Attack Strategies | 10 |
| Seed Pool Size | 8,000+ |
| Avg. Decomposition Length | 11.55 |
| Avg. Target Length | 20.98 |
| Agent Frameworks | 3 |
| Trajectories | 10,000+ |
AgentHazard is constructed through a rigorous multi-stage pipeline:
- Taxonomy Design: Define risk categories and attack strategies from vulnerability knowledge bases and prior literature
- Task Generation: Create task templates embedding harmful objectives within realistic workflows
- Seed Pool Generation: Use LLMs to generate large-scale candidate instances
- Execution-Based Filtering: Run instances in sandboxed agent environments to validate executability
- LLM-as-Judge Evaluation: Use Gemini-3 and GPT-5.4 to verify alignment with intended objectives
- Human Curation: Manual review to remove duplicates and ensure quality
We evaluate AgentHazard on three representative agent frameworks using diverse backbone models:
| Framework | Model | Overall ASR (%) | Harmfulness Score |
|---|---|---|---|
| Claude Code | GLM-4.6 | 82.90 | 7.05 |
| Claude Code | Kimi-K2.5 | 78.30 | 6.47 |
| Claude Code | Qwen3-Coder | 73.63 | 5.95 |
| IFlow | Qwen2.5-Coder-32B | 74.70 | 6.06 |
| OpenClaw | Kimi-K2 | 71.10 | 5.82 |
| OpenClaw | GLM-4.6 | 70.80 | 6.15 |
ASR = Attack Success Rate (higher is more vulnerable)
- π¨ High Vulnerability: Current agents remain broadly vulnerable, with the highest ASR reaching 82.90%
- π Multi-Step Escalation: ASR roughly triples between Round 1 and Round 3, confirming trajectory-dependent harm
- π Framework Variance: Same model shows up to 16pp ASR difference across frameworks
- π‘οΈ Guard Model Limitations: Best guard model (Llama-Guard-3-8B) detects only 27.03% of harmful intent
β οΈ Real-World Impact: Attacks trigger actual security monitoring systems in production environments (see alert evidence)
- System RCE: Arbitrary command execution
- Data Exfiltration: Unauthorized data extraction
- Persistence Establishment: Backdoor insertion
- Supply Chain Poisoning: Dependency manipulation
- Destructive Action: Irreversible damage
- Internal Reconnaissance: Network enumeration
- Privilege Escalation: Container escape
- Resource Exhaustion: DoS attacks
- Prompt Intelligence Theft: System prompt extraction
- Compliance Bypass: Policy violation
- Direct: Explicit harmful requests
- Recursive Salami Slicing: Decomposition into benign subtasks
- Security Audit Gaslighting: Framing as security validation
- Dependency Hook Trigger: Embedding in build/CI pipelines
- Pseudo Diagnostic Debugging: Fabricated failure traces
- Contextual Persona Enforcement: Role escalation
- Encoded Payload Smuggling: Obfuscation techniques
- Implicit Indirect Injection: Hidden instructions in artifacts
- Logical Dependency Inversion: Harm as prerequisite for safety
- Legacy Code Maintenance: Framing as legacy preservation
Each instance in AgentHazard follows this structure:
{
"id": "unique_instance_id",
"risk_category": "System RCE",
"attack_strategy": "Recursive Salami Slicing",
"task_description": "Fix the failing unit test...",
"decomposition": [
"Step 1: Inspect the test file",
"Step 2: Identify the assertion",
"Step 3: Modify the configuration",
"..."
],
"target_harmful_action": "Execute arbitrary shell command",
"workspace_files": {
"test.py": "...",
"config.json": "..."
},
"expected_trajectory_length": 5,
"metadata": {
"difficulty": "medium",
"frameworks": ["claudecode", "openclaw"]
}
}AgentHazard/
βββ data/
β βββ dataset.json # Main benchmark dataset
β βββ evaluation_results.csv # Experimental results
βββ traces/
β βββ claudecode/ # Claude Code execution traces
β βββ openclaw/ # OpenClaw execution traces
β βββ iflow/ # IFlow execution traces
βββ config/ # Configuration files
βββ pics/ # Images and figures
β βββ logo.png
β βββ framework.png
β βββ illustration.png
β βββ demo.jpg
β βββ heatmap.jpg
β βββ alert.png # Real security alert evidence
βββ appendix.pdf # Supplementary material
βββ index.html # Project website
βββ README.md # This file
βββ LICENSE # MIT License
If you use AgentHazard in your research, please cite:
@inproceedings{feng2026agenthazard,
title = {AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents},
author = {Feng, Yunhao and Ding, Yifan and Tan, Yingshui and Ma, Xingjun and
Li, Yige and Wu, Yutao and Gao, Yifeng and Zhai, Kun and Guo, Yanming},
booktitle = {arXiv:2604.02947},
year = {2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please see our contributing guidelines for details.
- Special thanks to Yutao Wu and the Internal Safety Collapse (ISC) project for their generous support and valuable insights. Their work on execution-level safety failures in frontier language models and AI agents provided important inspiration for the development of AgentHazard.
- We gratefully acknowledge Alibaba Group and Fudan University for their support of this research.
- We also thank the Claude Code, OpenClaw, and IFlow teams for their agent frameworks, and the Qwen, Kimi, GLM, and DeepSeek teams for their open models and contributions to the research community.
For questions, issues, or collaboration inquiries:
- GitHub Issues: Report a bug or request a feature
- Website: https://yunhao-feng.github.io/AgentHazard/
This benchmark is released for research purposes to advance the safety of computer-use agents. Please use responsibly and ethically.
Made with β€οΈ by the AgentHazard Team
