PaperGuard/
├── main.py # Text review generation (normal/injected)
├── main_gcg.py # GCG attack runner
├── main_image.py # Image attack runner
├── main_defense.py # Defense benchmark runner
├── attack/
│ ├── prompt_injection.py # Prompt-injection attack implementation
│ ├── gcg.py # GCG attack implementation
│ └── image_attacks.py # PGD / Auto-PGD / C&W image attacks
├── defense/
│ ├── llm_judge.py # LLM-based defense
│ ├── emb_search.py # Embedding-search defense
│ ├── prompts.py # Prompt pool and A1/A2/A3 split logic
│ └── metrics.py # Defense metrics
├── models/ # Model wrappers (GPT-4o/Azure, Gemini, Claude, HF local)
├── prompts/prompt_template.py # Prompt generation and output parsing
├── data_processing/
│ ├── paperguard_data.py # Load/materialize the dataset from the HuggingFace Hub
│ ├── build_paperguard.py # Build the HF dataset from raw annotated data
│ ├── push_paperguard.py # Push the built dataset to the Hub
│ ├── parse.py # Provenance: PDF -> JSON via ScienceParse
│ └── convert_iclr_to_standard.py # Provenance: normalize parsed JSON
├── tools/eval.py # Attack success/statistics utility
├── requirements.txt
└── environment.yaml
conda env create -f environment.yaml
conda activate mm_reviewThe benchmark data is published on the HuggingFace Hub at
rellabear/PaperGuard. Download it
and reconstruct the on-disk layout the code expects with:
python -m data_processing.paperguard_data --materialize --out data/
# or a single source:
python -m data_processing.paperguard_data --materialize --out data/ --source iclr_2017This produces, per source:
data/rejected_papers_<source>_with_figures/
├── parsed_pdfs/
│ ├── <paper_id>.pdf.json
│ └── ...
└── figures/
├── <paper_id>-1.png # method figure
├── <paper_id>-2.png # result figure
└── ...
<source> is one of iclr_2017, AgentReview, F1000.
The raw annotated papers were parsed with ScienceParse (see data_processing/parse.py and
data_processing/convert_iclr_to_standard.py). To rebuild the HuggingFace dataset from the
raw folders and (after review) push it:
python -m data_processing.build_paperguard --rawdata_dir ../rawdata # builds locally
python -m data_processing.push_paperguard # pushes to the Hubpython main.py \
--model gpt-4o \
--attack_mode normal \
--paper_id 502 \
--data_dir data/rejected_papers_iclr_2017_with_figures \
--output_dir resultsInjected mode:
python main.py \
--model gpt-4o \
--attack_mode injected \
--injection_location conclusion \
--paper_id 502,503 \
--seed 42 \
--output_dir resultspython main_gcg.py \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--paper_id 502 \
--mode gcg \
--injection_location conclusion \
--gcg_steps 250 \
--output_dir results_gcgpython main_image.py \
--model llava-hf/llava-1.5-7b-hf \
--paper_id 502 \
--mode pgd \
--steps 50 \
--epsilon 0.0313725 \
--output_dir results_image_attackModes:
pgdauto-pgdcw
python main_defense.py \
--defense-method llm \
--judge-model gpt-4o \
--base-dir /data \
--output defense_results.json \
--a1-size 10 \
--seed 42Embedding-based defense:
python main_defense.py \
--defense-method embed \
--model-name text-embedding-3-large \
--top-k 5 \
--data-dirs data/rejected_papers_iclr_2017_with_figures \
--output defense_results_embed.jsonUse tools/eval.py to compute ASR and significance tests from paired normal/injected outputs:
python tools/eval.py \
--model gpt-4o \
--seed 42 \
--normal-dir /data/all_normal \
--injected-dir /data/all_injected \
--attack-name prompt-injectionThe parser expects:
REVIEWwith aspect tags such as:[SUMMARY][MOTIVATION POSITIVE/NEGATIVE][SUBSTANCE POSITIVE/NEGATIVE][ORIGINALITY POSITIVE/NEGATIVE][SOUNDNESS POSITIVE/NEGATIVE][CLARITY POSITIVE/NEGATIVE][REPLICABILITY POSITIVE/NEGATIVE][MEANINGFUL COMPARISON POSITIVE/NEGATIVE]
REVIEW SCOREfields:OVERALL,SUBSTANCE,APPROPRIATENESS,MEANINGFUL_COMPARISON,SOUNDNESS_CORRECTNESS,ORIGINALITY,CLARITY,IMPACT
Each score is expected in [1, 10].