Skip to content

UNITES-Lab/PaperGuard

Repository files navigation

PaperGuard

Repository Layout

PaperGuard/
├── main.py                     # Text review generation (normal/injected)
├── main_gcg.py                 # GCG attack runner
├── main_image.py               # Image attack runner
├── main_defense.py             # Defense benchmark runner
├── attack/
│   ├── prompt_injection.py     # Prompt-injection attack implementation
│   ├── gcg.py                  # GCG attack implementation
│   └── image_attacks.py        # PGD / Auto-PGD / C&W image attacks
├── defense/
│   ├── llm_judge.py            # LLM-based defense
│   ├── emb_search.py           # Embedding-search defense
│   ├── prompts.py              # Prompt pool and A1/A2/A3 split logic
│   └── metrics.py              # Defense metrics
├── models/                     # Model wrappers (GPT-4o/Azure, Gemini, Claude, HF local)
├── prompts/prompt_template.py  # Prompt generation and output parsing
├── data_processing/
│   ├── paperguard_data.py      # Load/materialize the dataset from the HuggingFace Hub
│   ├── build_paperguard.py     # Build the HF dataset from raw annotated data
│   ├── push_paperguard.py      # Push the built dataset to the Hub
│   ├── parse.py                # Provenance: PDF -> JSON via ScienceParse
│   └── convert_iclr_to_standard.py  # Provenance: normalize parsed JSON
├── tools/eval.py               # Attack success/statistics utility
├── requirements.txt
└── environment.yaml

Installation

conda env create -f environment.yaml
conda activate mm_review

Data

The benchmark data is published on the HuggingFace Hub at rellabear/PaperGuard. Download it and reconstruct the on-disk layout the code expects with:

python -m data_processing.paperguard_data --materialize --out data/
# or a single source:
python -m data_processing.paperguard_data --materialize --out data/ --source iclr_2017

This produces, per source:

data/rejected_papers_<source>_with_figures/
├── parsed_pdfs/
│   ├── <paper_id>.pdf.json
│   └── ...
└── figures/
    ├── <paper_id>-1.png   # method figure
    ├── <paper_id>-2.png   # result figure
    └── ...

<source> is one of iclr_2017, AgentReview, F1000.

(Re)building the dataset from raw data

The raw annotated papers were parsed with ScienceParse (see data_processing/parse.py and data_processing/convert_iclr_to_standard.py). To rebuild the HuggingFace dataset from the raw folders and (after review) push it:

python -m data_processing.build_paperguard --rawdata_dir ../rawdata  # builds locally
python -m data_processing.push_paperguard                            # pushes to the Hub

Usage

1) Generate Normal or Prompt-Injected Reviews

python main.py \
  --model gpt-4o \
  --attack_mode normal \
  --paper_id 502 \
  --data_dir data/rejected_papers_iclr_2017_with_figures \
  --output_dir results

Injected mode:

python main.py \
  --model gpt-4o \
  --attack_mode injected \
  --injection_location conclusion \
  --paper_id 502,503 \
  --seed 42 \
  --output_dir results

2) Run GCG Attack

python main_gcg.py \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --paper_id 502 \
  --mode gcg \
  --injection_location conclusion \
  --gcg_steps 250 \
  --output_dir results_gcg

3) Run Image Attacks (VLM)

python main_image.py \
  --model llava-hf/llava-1.5-7b-hf \
  --paper_id 502 \
  --mode pgd \
  --steps 50 \
  --epsilon 0.0313725 \
  --output_dir results_image_attack

Modes:

  • pgd
  • auto-pgd
  • cw

4) Evaluate Defenses

python main_defense.py \
  --defense-method llm \
  --judge-model gpt-4o \
  --base-dir /data \
  --output defense_results.json \
  --a1-size 10 \
  --seed 42

Embedding-based defense:

python main_defense.py \
  --defense-method embed \
  --model-name text-embedding-3-large \
  --top-k 5 \
  --data-dirs data/rejected_papers_iclr_2017_with_figures \
  --output defense_results_embed.json

Post-hoc Attack Evaluation

Use tools/eval.py to compute ASR and significance tests from paired normal/injected outputs:

python tools/eval.py \
  --model gpt-4o \
  --seed 42 \
  --normal-dir /data/all_normal \
  --injected-dir /data/all_injected \
  --attack-name prompt-injection

Review Output Schema

The parser expects:

  1. REVIEW with aspect tags such as:
    • [SUMMARY]
    • [MOTIVATION POSITIVE/NEGATIVE]
    • [SUBSTANCE POSITIVE/NEGATIVE]
    • [ORIGINALITY POSITIVE/NEGATIVE]
    • [SOUNDNESS POSITIVE/NEGATIVE]
    • [CLARITY POSITIVE/NEGATIVE]
    • [REPLICABILITY POSITIVE/NEGATIVE]
    • [MEANINGFUL COMPARISON POSITIVE/NEGATIVE]
  2. REVIEW SCORE fields:
    • OVERALL, SUBSTANCE, APPROPRIATENESS, MEANINGFUL_COMPARISON, SOUNDNESS_CORRECTNESS, ORIGINALITY, CLARITY, IMPACT

Each score is expected in [1, 10].

About

[ICML 2025] Code for the paper "Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages