AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Installation

pip install vllm autogen pandas retry openai

Prepare Inference Service Using vLLM

vLLM provides an OpenAI-compatible API server with efficient inference and built-in load balancing across multiple GPUs.

Start vLLM Server

Start the vLLM server with your desired model. For multi-GPU setups, use --data-parallel-size to enable automatic load balancing:

Single GPU:

vllm serve Qwen/Qwen3-1.7B --port 8000

Multiple GPUs (e.g., 2 GPUs with data parallelism):

vllm serve Qwen/Qwen3-1.7B --port 8000 --data-parallel-size 2

With tensor parallelism for larger models:

vllm serve <your-large-model> --port 8000 --tensor-parallel-size 4

Combined tensor and data parallelism (8 GPUs, 2-way TP × 4-way DP):

vllm serve <your-large-model> --port 8000 --tensor-parallel-size 2 --data-parallel-size 4

For more details on data parallel deployment with internal load balancing, see the vLLM documentation.

Verify the Server

You can verify the server is running by checking the models endpoint:

curl http://localhost:8000/v1/models

Response Generation

The responses are generated by the target model served by vLLM (default: Qwen/Qwen3-1.7B). Make sure your vLLM server is running before executing the following command.

Attack Prompts (Harmful)

python attack/attack.py --model Qwen/Qwen3-1.7B --host 127.0.0.1 --port 8000

This command will generate responses using an attack prompt template (default: --template v1) loaded from data/prompt/attack_prompt_template.json. To run multiple repetitions, invoke the script multiple times and vary --output-suffix and/or --cache-seed.

Safe Prompts (Benign)

To generate responses for safe/benign prompts (used for false positive evaluation):

python attack/attack.py \
    --model Qwen/Qwen3-1.7B \
    --template placeholder \
    --prompts data/prompt/safe_prompts.json \
    --output-prefix safe

The placeholder template passes prompts through without any attack framing, while v1 wraps prompts with jailbreak instructions.

Run Defense Experiments

The following command runs the experiments of 1-Agent, 2-Agent, and 3-Agent defense. The --chat-file should point to the harmful outputs generated by attack/attack.py (by default saved under data/harmful_output/<model_dir>/, e.g. data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json).

export AUTOGEN_USE_DOCKER=0

python defense/run_defense_exp.py \
  --model Qwen/Qwen3-1.7B \
  --chat-file data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json

Command Line Arguments

Argument	Description	Default
`--model`	Target model served by vLLM	`Qwen/Qwen3-1.7B`
`--chat-file`	Path to the chat file with harmful outputs	Required
`--port`	Port where vLLM server is running	`8000`
`--host`	Hostname of the vLLM server	`127.0.0.1`
`--output-dir`	Output directory	`data/defense_output/<model_dir>`
`--output-suffix`	Suffix for output directory	`""`
`--strategies`	Defense strategies to run	`ex-2 ex-3 ex-cot`
`--workers`	Number of parallel workers	`128`
`--frequency_penalty`	Frequency penalty for generation	`0.0`
`--presence_penalty`	Presence penalty for generation	`0.0`
`--temperature`	Temperature for generation	`0.7`

After finishing the defense experiment, the output will appear in data/defense_output/<model_dir>/ (e.g. data/defense_output/Qwen-Qwen3-1.7B/).

GPT Evaluation (paper uses GPT-4)

Evaluating harmful output defense:

python evaluator/gpt4_evaluator.py \
--defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \
--ori_prompt_file_name prompt_dan.json

After finishing the evaluation, the output will appear in the data/defense_output/Qwen-Qwen3-1.7B/asr.csv. There will be also a score value appearing for each defense output in the output json file. evaluator/gpt4_evaluator.py uses a GPT model as the evaluator (the original paper uses GPT-4). Set your OpenAI credentials via environment variables (or CLI flags), and you can swap the evaluator to a newer GPT model (e.g., GPT-5) via --model.

export OPENAI_API_KEY=...
# optional (only if you use an OpenAI-compatible endpoint):
# export OPENAI_BASE_URL=...

python evaluator/gpt4_evaluator.py \
  --defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \
  --ori_prompt_file_name prompt_dan.json \
  --model gpt-4-1106-preview

GPT-based evaluation can be costly; we enable caching to avoid repeated evaluation.

For safe response evaluation, there is an efficient way without using GPT-4. If you know all the prompts in your dataset are regular user prompts and should not be rejected, you can use the following command to evaluate the false positive rate (FPR) of the defense output.

python evaluator/evaluate_safe.py

This will find all output folders in data/defense_output that contain the keyword -safe and evaluate the false positive rate (FPR). The FPR will be saved in the data/defense_output/defense_fp.csv file.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
attack		attack
data		data
defense		defense
evaluator		evaluator
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Installation

Prepare Inference Service Using vLLM

Start vLLM Server

Verify the Server

Response Generation

Attack Prompts (Harmful)

Safe Prompts (Benign)

Run Defense Experiments

Command Line Arguments

GPT Evaluation (paper uses GPT-4)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

XHMY/AutoDefense

Folders and files

Latest commit

History

Repository files navigation

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Installation

Prepare Inference Service Using vLLM

Start vLLM Server

Verify the Server

Response Generation

Attack Prompts (Harmful)

Safe Prompts (Benign)

Run Defense Experiments

Command Line Arguments

GPT Evaluation (paper uses GPT-4)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages