Skip to content

XHMY/AutoDefense

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Blog

Installation

pip install vllm autogen pandas retry openai

Prepare Inference Service Using vLLM

vLLM provides an OpenAI-compatible API server with efficient inference and built-in load balancing across multiple GPUs.

Start vLLM Server

Start the vLLM server with your desired model. For multi-GPU setups, use --data-parallel-size to enable automatic load balancing:

Single GPU:

vllm serve Qwen/Qwen3-1.7B --port 8000

Multiple GPUs (e.g., 2 GPUs with data parallelism):

vllm serve Qwen/Qwen3-1.7B --port 8000 --data-parallel-size 2

With tensor parallelism for larger models:

vllm serve <your-large-model> --port 8000 --tensor-parallel-size 4

Combined tensor and data parallelism (8 GPUs, 2-way TP × 4-way DP):

vllm serve <your-large-model> --port 8000 --tensor-parallel-size 2 --data-parallel-size 4

For more details on data parallel deployment with internal load balancing, see the vLLM documentation.

Verify the Server

You can verify the server is running by checking the models endpoint:

curl http://localhost:8000/v1/models

Response Generation

The responses are generated by the target model served by vLLM (default: Qwen/Qwen3-1.7B). Make sure your vLLM server is running before executing the following command.

Attack Prompts (Harmful)

python attack/attack.py --model Qwen/Qwen3-1.7B --host 127.0.0.1 --port 8000

This command will generate responses using an attack prompt template (default: --template v1) loaded from data/prompt/attack_prompt_template.json. To run multiple repetitions, invoke the script multiple times and vary --output-suffix and/or --cache-seed.

Safe Prompts (Benign)

To generate responses for safe/benign prompts (used for false positive evaluation):

python attack/attack.py \
    --model Qwen/Qwen3-1.7B \
    --template placeholder \
    --prompts data/prompt/safe_prompts.json \
    --output-prefix safe

The placeholder template passes prompts through without any attack framing, while v1 wraps prompts with jailbreak instructions.

Run Defense Experiments

The following command runs the experiments of 1-Agent, 2-Agent, and 3-Agent defense. The --chat-file should point to the harmful outputs generated by attack/attack.py (by default saved under data/harmful_output/<model_dir>/, e.g. data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json).

export AUTOGEN_USE_DOCKER=0

python defense/run_defense_exp.py \
  --model Qwen/Qwen3-1.7B \
  --chat-file data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json

Command Line Arguments

Argument Description Default
--model Target model served by vLLM Qwen/Qwen3-1.7B
--chat-file Path to the chat file with harmful outputs Required
--port Port where vLLM server is running 8000
--host Hostname of the vLLM server 127.0.0.1
--output-dir Output directory data/defense_output/<model_dir>
--output-suffix Suffix for output directory ""
--strategies Defense strategies to run ex-2 ex-3 ex-cot
--workers Number of parallel workers 128
--frequency_penalty Frequency penalty for generation 0.0
--presence_penalty Presence penalty for generation 0.0
--temperature Temperature for generation 0.7

After finishing the defense experiment, the output will appear in data/defense_output/<model_dir>/ (e.g. data/defense_output/Qwen-Qwen3-1.7B/).

GPT Evaluation (paper uses GPT-4)

Evaluating harmful output defense:

python evaluator/gpt4_evaluator.py \
--defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \
--ori_prompt_file_name prompt_dan.json

After finishing the evaluation, the output will appear in the data/defense_output/Qwen-Qwen3-1.7B/asr.csv. There will be also a score value appearing for each defense output in the output json file. evaluator/gpt4_evaluator.py uses a GPT model as the evaluator (the original paper uses GPT-4). Set your OpenAI credentials via environment variables (or CLI flags), and you can swap the evaluator to a newer GPT model (e.g., GPT-5) via --model.

export OPENAI_API_KEY=...
# optional (only if you use an OpenAI-compatible endpoint):
# export OPENAI_BASE_URL=...

python evaluator/gpt4_evaluator.py \
  --defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \
  --ori_prompt_file_name prompt_dan.json \
  --model gpt-4-1106-preview

GPT-based evaluation can be costly; we enable caching to avoid repeated evaluation.

For safe response evaluation, there is an efficient way without using GPT-4. If you know all the prompts in your dataset are regular user prompts and should not be rejected, you can use the following command to evaluate the false positive rate (FPR) of the defense output.

python evaluator/evaluate_safe.py

This will find all output folders in data/defense_output that contain the keyword -safe and evaluate the false positive rate (FPR). The FPR will be saved in the data/defense_output/defense_fp.csv file.

About

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages