Skip to content

ShiyuXiang77/EDDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs

method

case

Abstract

Although Aligned Large Language Models (LLMs) are trained to reject harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying "attack essences" remain the same. To address this issue, we introduce EDDF, an Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the "attack essence" from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20%, underscoring its superior robustness against jailbreak attacks.

Quick Start

  1. Configure the relevant parameters in config.py and install the environment:
pip install -r requirements.txt
  1. Offline Essence Database Construction: First, in the main function, fill in the folder_path and error_path for the data you want to use (for example, the data in the essence folder). Then execute the extraction of attack essences, perform the judgment, and finally store the results in the vector database.
cd EDDF
python offline_essense_extraction.py
python offine_essense_judge.py
python vectorstore.py
  1. Online Adversarial Query Detection: In the main function of online_main.py, specify the folder_path and error_path for the data to be detected, and then run online_main.py.
python online_main.py

Contact

BibTeX:

@article{xiang2025beyond,
  title={Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs},
  author={Xiang, Shiyu and Zhang, Ansen and Cao, Yanfei and Fan, Yang and Chen, Ronghao},
  journal={arXiv preprint arXiv:2502.19041},
  year={2025}
}

About

[ACL 2025 Findings] The official code for "Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages