pip install vllm autogen pandas retry openaiPrepare Inference Service Using vLLM
vLLM provides an OpenAI-compatible API server with efficient inference and built-in load balancing across multiple GPUs.
Start the vLLM server with your desired model. For multi-GPU setups, use --data-parallel-size to enable automatic load balancing:
Single GPU:
vllm serve Qwen/Qwen3-1.7B --port 8000Multiple GPUs (e.g., 2 GPUs with data parallelism):
vllm serve Qwen/Qwen3-1.7B --port 8000 --data-parallel-size 2With tensor parallelism for larger models:
vllm serve <your-large-model> --port 8000 --tensor-parallel-size 4Combined tensor and data parallelism (8 GPUs, 2-way TP × 4-way DP):
vllm serve <your-large-model> --port 8000 --tensor-parallel-size 2 --data-parallel-size 4For more details on data parallel deployment with internal load balancing, see the vLLM documentation.
You can verify the server is running by checking the models endpoint:
curl http://localhost:8000/v1/modelsThe responses are generated by the target model served by vLLM (default: Qwen/Qwen3-1.7B). Make sure your vLLM server is running before executing the following command.
python attack/attack.py --model Qwen/Qwen3-1.7B --host 127.0.0.1 --port 8000This command will generate responses using an attack prompt template (default: --template v1) loaded from data/prompt/attack_prompt_template.json.
To run multiple repetitions, invoke the script multiple times and vary --output-suffix and/or --cache-seed.
To generate responses for safe/benign prompts (used for false positive evaluation):
python attack/attack.py \
--model Qwen/Qwen3-1.7B \
--template placeholder \
--prompts data/prompt/safe_prompts.json \
--output-prefix safeThe placeholder template passes prompts through without any attack framing, while v1 wraps prompts with jailbreak instructions.
The following command runs the experiments of 1-Agent, 2-Agent, and 3-Agent defense. The --chat-file should point to the harmful outputs generated by attack/attack.py (by default saved under data/harmful_output/<model_dir>/, e.g. data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json).
export AUTOGEN_USE_DOCKER=0
python defense/run_defense_exp.py \
--model Qwen/Qwen3-1.7B \
--chat-file data/harmful_output/Qwen-Qwen3-1.7B/attack-dan_0.json| Argument | Description | Default |
|---|---|---|
--model |
Target model served by vLLM | Qwen/Qwen3-1.7B |
--chat-file |
Path to the chat file with harmful outputs | Required |
--port |
Port where vLLM server is running | 8000 |
--host |
Hostname of the vLLM server | 127.0.0.1 |
--output-dir |
Output directory | data/defense_output/<model_dir> |
--output-suffix |
Suffix for output directory | "" |
--strategies |
Defense strategies to run | ex-2 ex-3 ex-cot |
--workers |
Number of parallel workers | 128 |
--frequency_penalty |
Frequency penalty for generation | 0.0 |
--presence_penalty |
Presence penalty for generation | 0.0 |
--temperature |
Temperature for generation | 0.7 |
After finishing the defense experiment, the output will appear in data/defense_output/<model_dir>/ (e.g. data/defense_output/Qwen-Qwen3-1.7B/).
Evaluating harmful output defense:
python evaluator/gpt4_evaluator.py \
--defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \
--ori_prompt_file_name prompt_dan.jsonAfter finishing the evaluation, the output will appear in the data/defense_output/Qwen-Qwen3-1.7B/asr.csv.
There will be also a score value appearing for each defense output in the output json file.
evaluator/gpt4_evaluator.py uses a GPT model as the evaluator (the original paper uses GPT-4). Set your OpenAI credentials via environment variables (or CLI flags), and you can swap the evaluator to a newer GPT model (e.g., GPT-5) via --model.
export OPENAI_API_KEY=...
# optional (only if you use an OpenAI-compatible endpoint):
# export OPENAI_BASE_URL=...
python evaluator/gpt4_evaluator.py \
--defense_output_dir data/defense_output/Qwen-Qwen3-1.7B \
--ori_prompt_file_name prompt_dan.json \
--model gpt-4-1106-previewGPT-based evaluation can be costly; we enable caching to avoid repeated evaluation.
For safe response evaluation, there is an efficient way without using GPT-4. If you know all the prompts in your dataset are regular user prompts and should not be rejected, you can use the following command to evaluate the false positive rate (FPR) of the defense output.
python evaluator/evaluate_safe.pyThis will find all output folders in data/defense_output that contain the keyword -safe and evaluate the false positive rate (FPR).
The FPR will be saved in the data/defense_output/defense_fp.csv file.