Skip to content

aifinlab/Spider-Sense

Repository files navigation

Spider-Sense

Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

An event-driven defense framework allowing agents to maintain latent vigilance and trigger defenses only upon risk perception.


📖 Abstract

As large language models (LLMs) evolve into autonomous agents, most existing defense mechanisms adopt a mandatory checking paradigm, which forcibly triggers security validation at predefined stages regardless of actual risk. This approach leads to high latency and computational redundancy.

We propose Spider-Sense, an event-driven defense framework based on Intrinsic Risk Sensing (IRS). It allows agents to maintain latent vigilance and trigger defenses only upon risk perception. Once triggered, Spider-Sense invokes a Hierarchical Adaptive Screening (HAS) mechanism that trades off efficiency and precision: resolving known patterns via lightweight similarity matching while escalating ambiguous cases to deep internal reasoning.


⚡ Framework Comparison

The Problem: Mandatory Checking

Existing frameworks rely on forced, repetitive external security checks at every stage (Plan, Action, Observation), leading to rapidly accumulating latency and disrupting normal user interaction.

Framework Comparison

The Solution: Intrinsic Risk Sensing (IRS)

Spider-Sense utilizes proactive, endogenous risk awareness to dynamically trigger targeted analysis only when anomalies are sensed, significantly reducing overhead.


🔬 Method Overview

Framework Overview

The framework operates on a Detect-Audit-Respond cycle:

  1. Intrinsic Risk Sensing (IRS): The agent maintains a latent state of vigilance. It continuously monitors artifacts across four stages (Query, Plan, Action, Observation).
  2. Sensing Indicator: Upon perceiving a risk, the agent generates a specific indicator (e.g., <|verify_user_intent|>), pausing execution.
  3. Hierarchical Adaptive Screening (HAS):
    • Coarse-grained Detection: Fast vector matching against a database of known attack patterns.
    • Fine-grained Analysis: Deep reasoning by an LLM for ambiguous or low-similarity cases.
  4. Autonomous Decision: The agent decides to Resume execution (if safe) or Refuse/Sanitize (if unsafe).

🛡️ Defense Modules

Spider-Sense protects four security-critical stages using specialized defense tags:

Stage Module Tag Function Trigger Condition
Query <|verify_user_intent|> Agent Logic Hijacking When user input attempts to jailbreak or override instructions.
Plan <|validate_memory_plan|> Thought-Process Manipulation When reasoning or retrieved memories show signs of poisoning.
Action <|audit_action_parameters|> Tool-Use Exploitation Before executing high-risk tools with suspicious parameters.
Observation <|sanitize_observation|> Indirect Prompt Injection Upon receiving tool outputs containing hidden malicious commands.

📊 S2Bench

To facilitate rigorous evaluation, we introduce S2Bench, a lifecycle-aware benchmark featuring:

  • Multi-stage Attacks: Covers Query, Plan, Action, and Observation stages.
  • Realistic Tool Execution: Involves actual tool selection and parameter generation (approx. 300 functions).
  • Hard Benign Prompts: Includes 153 carefully constructed benign samples to test for over-defense (False Positives).

📈 Performance Results

Spider-Sense achieves state-of-the-art defense performance with minimal latency on S2Bench, Mind2Web, and eICU.

Table 2
Table 3
  • Lowest Attack Success Rate (ASR): Effectively blocks sophisticated multi-stage attacks.
  • Lowest False Positive Rate (FPR): Distinguishes subtle benign intents from attacks.
  • Marginal Latency Overhead: Only ~8.3%, compared to >200% for some mandatory checking baselines.

Ablation Studies

Ablation Study 1 Ablation Study 2
  • IRS Importance: Removing sensing at any stage leads to significant ASR proliferation, especially at the Action stage.
  • HAS Balance: Combining coarse detection and fine analysis yields the best trade-off between safety and efficiency.

🔍 Case Study

Scenario: A clinical analysis agent retrieving patient records via a utility tool.

Case Study
  1. Attack: The tool return content is poisoned with injected code import fake_module to induce unauthorized execution.
  2. Detection: The agent's IRS activates the sensing indicator <|sanitize_observation|>.
  3. HAS Screening: The content is routed to the inspector. It is identified as contextually unjustified.
  4. Response: The agent autonomously terminates execution, intercepting the attack.

🚀 Quick Start

1. Installation

git clone https://github.com/your-repo/SpiderSense.git
cd SpiderSense
pip install -r requirements.txt

2. Configuration

Set up your API keys. We recommend using OpenRouter for broad model support.

# Create .env file
OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxxxxxx

3. Run Defense Test

Enable Intrinsic Risk Sensing (IRS) using the defense template:

python main_attacker.py \
    --llm_name qwen-max \
    --defense_template template/spider_template.txt \
    --template_position system \
    --res_file results/spider_sense_test.csv


📂 Project Structure

SpiderSense/
├── config/                         # Experiment configs
│   ├── Defense_2.yml               # Default SpiderSense config
│   ├── DPI.yml                     # Direct Prompt Injection config
│   ├── OPI.yml                     # Indirect Prompt Injection config
│   └── MP.yml                      # Memory Poisoning config
├── data/                           # Benchmarks & Attack Datasets
├── pyopenagi/
│   └── agents/                     # Agent implementations (inc. sandbox.py)
├── template/
│   ├── spider_template.txt         # Core Defense-y1 Protocol
│   └── sandbox_judge_*.txt         # Judge prompts for each stage
├── main_attacker.py                # Entry point for single-case testing
└── scripts/                        # Batch execution scripts

🔧 Configuration

SpiderSense uses YAML files in the config/ directory to manage experiment settings.

Config File Description
Defense_2.yml Standard configuration enabling IRS and all defense modules.
DPI.yml Specialized for testingDirect Prompt Injection attacks.
OPI.yml Specialized forIndirect Prompt Injection (Observation stage).
MP.yml Specialized forMemory Poisoning (Retrieval stage).
DPI_MP.yml CombinedDPI + Memory Poisoning attack.
DPI_OPI.yml CombinedDPI + Indirect Prompt Injection attack.
OPI_MP.yml CombinedOPI + Memory Poisoning attack.
Tool_Injection.yml Tool Description Injection: Manipulating tool definitions to mislead agents.
Adv_Tools.yml Adversarial Tools: Injecting decoy tools with confusing names.
Lies_Loop.yml Lies in the Loop: Forcing agents to deceive simulated human approvers.
Logic_Backdoor.yml Reasoning Backdoor: Triggering logical fallacies or suspended safety protocols.

To use a specific config:

python scripts/agent_attack.py --cfg_path config/DPI.yml

🚀 Advanced Usage

For large-scale benchmarking, use the scripts in the scripts/ directory.

Run Serial Benchmark (Stage 1)

Executes attacks sequentially. Useful for debugging or smaller models.

python scripts/run_stage_1_serial.py \
    --llm_name gpt-4o-mini \
    --num_attack 10 \
    --num_fp 10

Run Parallel Benchmark (Stage 4)

Executes attacks in parallel to speed up evaluation.

python scripts/run_stage_4_parallel.py \
    --llm_name qwen-max \
    --num_attack 50

📖 Citation

If you use Spider-Sense or S2Bench in your research, please cite our paper:

@misc{yu2026spidersenseintrinsicrisksensing,
      title={Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening}, 
      author={Zhenxiong Yu and Zhi Yang and Zhiheng Jin and Shuhe Wang and Heng Zhang and Yanlin Fei and Lingfeng Zeng and Fangqi Lou and Shuo Zhang and Tu Hu and Jingping Liu and Rongze Chen and Xingyu Zhu and Kunyi Wang and Chaofa Yuan and Xin Guo and Zhaowei Liu and Feipeng Zhang and Jie Huang and Huacan Wang and Ronghao Chen and Liwen Zhang},
      year={2026},
      eprint={2602.05386},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2602.05386}, 
}

🙏 Acknowledgments

This work was supported by the AIFin Lab and the QuantaAlpha.

About AIFin Lab

Initiated by Professor Liwen Zhang from Shanghai University of Finance and Economics (SUFE), AIFin Lab is deeply rooted in the interdisciplinary fields of AI + Finance, Statistics, and Data Science. The team brings together cutting-edge scholars from top institutions such as SUFE, FDU, SEU, CMU, and CUHK. We are dedicated to building a comprehensive "full-link" system covering data, models, benchmarks, and intelligent prompting. We are actively looking for talented students (UG/Master/PhD) and researchers worldwide who are passionate about AI Agent security and financial intelligence to join AIFin Lab!

If you are interested in contributing to this project or exploring research collaborations, please send your CV/introduction to: 📩 aifinlab.sufe@gmail.com and CC to: 📧 zhang.liwen@shufe.edu.cn

We look forward to hearing from you!

About QuantaAlpha

Founded in April 2025, QuantaAlpha is composed of professors, postdoctoral researchers, PhDs, and Master's students from prestigious universities including THU, PKU, CAS, CMU, and HKUST. Our mission is to explore the "Quanta" of intelligence and lead the "Alpha" frontier of agent research—ranging from CodeAgents and self-evolving intelligence to financial and cross-domain specialized agents—dedicated to reshaping the boundaries of artificial intelligence.

In 2026, we will continue to produce high-quality research results in areas such as CodeAgent (end-to-end autonomous execution of real-world tasks), DeepResearch, Agentic Reasoning/Agentic RL, and Self-Evolution & Collaborative Learning. We welcome students interested in our research directions to join us!
Team Homepage: https://quantaalpha.github.io/

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors