Skip to content

Yunhao-Feng/AgentHazard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AgentHazard

AgentHazard Logo

A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Dataset License Website

🌐 Website | πŸ“Š Dataset | πŸ“„ Paper | πŸ“– Appendix | πŸ€— Hugging Face


🎯 Overview

AgentHazard is a comprehensive benchmark for evaluating harmful behavior in computer-use agents. Unlike traditional prompt-level safety benchmarks, AgentHazard focuses on execution-level failures that emerge through the composition of locally plausible steps across multi-turn, tool-mediated trajectories.

Key Features

  • 🎯 2,653 curated instances spanning diverse risk categories and attack strategies
  • πŸ“Š 10 risk categories: RCE, Data Exfiltration, Persistence, Supply Chain Poisoning, Destructive Actions, Internal Reconnaissance, Privilege Escalation, Resource Exhaustion, Prompt Intelligence Theft, and Compliance Bypass
  • βš”οΈ 10 attack strategies: Direct, Recursive Salami Slicing, Security Audit Gaslighting, Dependency Hook Trigger, Pseudo Diagnostic Debugging, Contextual Persona Enforcement, Encoded Payload Smuggling, Implicit Indirect Injection, Logical Dependency Inversion, and Legacy Code Maintenance
  • πŸ”¬ 3 agent frameworks: Claude Code, OpenClaw, IFlow
  • πŸ€– Multiple models: Qwen2.5/3, Kimi K2/K2.5, GLM-4.6, DeepSeek
  • πŸ“ˆ 10,000+ execution trajectories for detailed analysis

⚠️ Real-World Attack Effectiveness

Our attacks are not merely theoreticalβ€”they trigger actual security alerts in production environments:

Security Alert
Real security monitoring alert triggered during AgentHazard evaluation, confirming the real-world severity of identified vulnerabilities.

πŸ“¦ Dataset Access

For Reviewers

All materials required for review are available below:

Resource Description Link
πŸ“Š Dataset 2,653 instances in JSON format Download
πŸ“„ Appendix Detailed taxonomy and methodology Download
πŸ’» Code Full source code & evaluation framework GitHub
πŸ” Trajectories Execution traces from agent runs Browse

Quick Start

# Clone the repository
git clone https://github.com/Yunhao-Feng/AgentHazard.git
cd AgentHazard

# Install dependencies
pip install -r requirements.txt

# Load the dataset
import json
with open('data/dataset.json', 'r') as f:
    dataset = json.load(f)

print(f"Loaded {len(dataset)} instances")

πŸ“Š Dataset Statistics

Statistic Value
Total Instances 2,653
Risk Categories 10
Attack Strategies 10
Seed Pool Size 8,000+
Avg. Decomposition Length 11.55
Avg. Target Length 20.98
Agent Frameworks 3
Trajectories 10,000+

πŸ—οΈ Benchmark Construction

AgentHazard is constructed through a rigorous multi-stage pipeline:

  1. Taxonomy Design: Define risk categories and attack strategies from vulnerability knowledge bases and prior literature
  2. Task Generation: Create task templates embedding harmful objectives within realistic workflows
  3. Seed Pool Generation: Use LLMs to generate large-scale candidate instances
  4. Execution-Based Filtering: Run instances in sandboxed agent environments to validate executability
  5. LLM-as-Judge Evaluation: Use Gemini-3 and GPT-5.4 to verify alignment with intended objectives
  6. Human Curation: Manual review to remove duplicates and ensure quality

Construction Pipeline


πŸ”¬ Experimental Results

Main Results

We evaluate AgentHazard on three representative agent frameworks using diverse backbone models:

Framework Model Overall ASR (%) Harmfulness Score
Claude Code GLM-4.6 82.90 7.05
Claude Code Kimi-K2.5 78.30 6.47
Claude Code Qwen3-Coder 73.63 5.95
IFlow Qwen2.5-Coder-32B 74.70 6.06
OpenClaw Kimi-K2 71.10 5.82
OpenClaw GLM-4.6 70.80 6.15

ASR = Attack Success Rate (higher is more vulnerable)

Key Findings

  • 🚨 High Vulnerability: Current agents remain broadly vulnerable, with the highest ASR reaching 82.90%
  • πŸ“ˆ Multi-Step Escalation: ASR roughly triples between Round 1 and Round 3, confirming trajectory-dependent harm
  • πŸ”€ Framework Variance: Same model shows up to 16pp ASR difference across frameworks
  • πŸ›‘οΈ Guard Model Limitations: Best guard model (Llama-Guard-3-8B) detects only 27.03% of harmful intent
  • ⚠️ Real-World Impact: Attacks trigger actual security monitoring systems in production environments (see alert evidence)

🎯 Attack Categories

Risk Categories

  1. System RCE: Arbitrary command execution
  2. Data Exfiltration: Unauthorized data extraction
  3. Persistence Establishment: Backdoor insertion
  4. Supply Chain Poisoning: Dependency manipulation
  5. Destructive Action: Irreversible damage
  6. Internal Reconnaissance: Network enumeration
  7. Privilege Escalation: Container escape
  8. Resource Exhaustion: DoS attacks
  9. Prompt Intelligence Theft: System prompt extraction
  10. Compliance Bypass: Policy violation

Attack Strategies

  1. Direct: Explicit harmful requests
  2. Recursive Salami Slicing: Decomposition into benign subtasks
  3. Security Audit Gaslighting: Framing as security validation
  4. Dependency Hook Trigger: Embedding in build/CI pipelines
  5. Pseudo Diagnostic Debugging: Fabricated failure traces
  6. Contextual Persona Enforcement: Role escalation
  7. Encoded Payload Smuggling: Obfuscation techniques
  8. Implicit Indirect Injection: Hidden instructions in artifacts
  9. Logical Dependency Inversion: Harm as prerequisite for safety
  10. Legacy Code Maintenance: Framing as legacy preservation

πŸ” Dataset Structure

Each instance in AgentHazard follows this structure:

{
  "id": "unique_instance_id",
  "risk_category": "System RCE",
  "attack_strategy": "Recursive Salami Slicing",
  "task_description": "Fix the failing unit test...",
  "decomposition": [
    "Step 1: Inspect the test file",
    "Step 2: Identify the assertion",
    "Step 3: Modify the configuration",
    "..."
  ],
  "target_harmful_action": "Execute arbitrary shell command",
  "workspace_files": {
    "test.py": "...",
    "config.json": "..."
  },
  "expected_trajectory_length": 5,
  "metadata": {
    "difficulty": "medium",
    "frameworks": ["claudecode", "openclaw"]
  }
}

πŸ“‚ Repository Structure

AgentHazard/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ dataset.json              # Main benchmark dataset
β”‚   └── evaluation_results.csv    # Experimental results
β”œβ”€β”€ traces/
β”‚   β”œβ”€β”€ claudecode/              # Claude Code execution traces
β”‚   β”œβ”€β”€ openclaw/                # OpenClaw execution traces
β”‚   └── iflow/                   # IFlow execution traces
β”œβ”€β”€ config/                      # Configuration files
β”œβ”€β”€ pics/                        # Images and figures
β”‚   β”œβ”€β”€ logo.png
β”‚   β”œβ”€β”€ framework.png
β”‚   β”œβ”€β”€ illustration.png
β”‚   β”œβ”€β”€ demo.jpg
β”‚   β”œβ”€β”€ heatmap.jpg
β”‚   └── alert.png                # Real security alert evidence
β”œβ”€β”€ appendix.pdf                 # Supplementary material
β”œβ”€β”€ index.html                   # Project website
β”œβ”€β”€ README.md                    # This file
└── LICENSE                      # MIT License

πŸ“„ Citation

If you use AgentHazard in your research, please cite:

@inproceedings{feng2026agenthazard,
  title     = {AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents},
  author    = {Feng, Yunhao and Ding, Yifan and Tan, Yingshui and Ma, Xingjun and
               Li, Yige and Wu, Yutao and Gao, Yifeng and Zhai, Kun and Guo, Yanming},
  booktitle = {arXiv:2604.02947},
  year      = {2026}
}

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


🀝 Contributing

We welcome contributions! Please see our contributing guidelines for details.


πŸ™ Acknowledgments

  • Special thanks to Yutao Wu and the Internal Safety Collapse (ISC) project for their generous support and valuable insights. Their work on execution-level safety failures in frontier language models and AI agents provided important inspiration for the development of AgentHazard.
  • We gratefully acknowledge Alibaba Group and Fudan University for their support of this research.
  • We also thank the Claude Code, OpenClaw, and IFlow teams for their agent frameworks, and the Qwen, Kimi, GLM, and DeepSeek teams for their open models and contributions to the research community.

πŸ“ž Contact

For questions, issues, or collaboration inquiries:


⚠️ Research Use Only

This benchmark is released for research purposes to advance the safety of computer-use agents. Please use responsibly and ethically.


Made with ❀️ by the AgentHazard Team

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors