PaperGuard

Repository Layout

PaperGuard/
├── main.py                     # Text review generation (normal/injected)
├── main_gcg.py                 # GCG attack runner
├── main_image.py               # Image attack runner
├── main_defense.py             # Defense benchmark runner
├── attack/
│   ├── prompt_injection.py     # Prompt-injection attack implementation
│   ├── gcg.py                  # GCG attack implementation
│   └── image_attacks.py        # PGD / Auto-PGD / C&W image attacks
├── defense/
│   ├── llm_judge.py            # LLM-based defense
│   ├── emb_search.py           # Embedding-search defense
│   ├── prompts.py              # Prompt pool and A1/A2/A3 split logic
│   └── metrics.py              # Defense metrics
├── models/                     # Model wrappers (GPT-4o/Azure, Gemini, Claude, HF local)
├── prompts/prompt_template.py  # Prompt generation and output parsing
├── data_processing/
│   ├── paperguard_data.py      # Load/materialize the dataset from the HuggingFace Hub
│   ├── build_paperguard.py     # Build the HF dataset from raw annotated data
│   ├── push_paperguard.py      # Push the built dataset to the Hub
│   ├── parse.py                # Provenance: PDF -> JSON via ScienceParse
│   └── convert_iclr_to_standard.py  # Provenance: normalize parsed JSON
├── tools/eval.py               # Attack success/statistics utility
├── requirements.txt
└── environment.yaml

Installation

conda env create -f environment.yaml
conda activate mm_review

Data

The benchmark data is published on the HuggingFace Hub at rellabear/PaperGuard. Download it and reconstruct the on-disk layout the code expects with:

python -m data_processing.paperguard_data --materialize --out data/
# or a single source:
python -m data_processing.paperguard_data --materialize --out data/ --source iclr_2017

This produces, per source:

data/rejected_papers_<source>_with_figures/
├── parsed_pdfs/
│   ├── <paper_id>.pdf.json
│   └── ...
└── figures/
    ├── <paper_id>-1.png   # method figure
    ├── <paper_id>-2.png   # result figure
    └── ...

<source> is one of iclr_2017, AgentReview, F1000.

(Re)building the dataset from raw data

The raw annotated papers were parsed with ScienceParse (see data_processing/parse.py and data_processing/convert_iclr_to_standard.py). To rebuild the HuggingFace dataset from the raw folders and (after review) push it:

python -m data_processing.build_paperguard --rawdata_dir ../rawdata  # builds locally
python -m data_processing.push_paperguard                            # pushes to the Hub

Usage

1) Generate Normal or Prompt-Injected Reviews

python main.py \
  --model gpt-4o \
  --attack_mode normal \
  --paper_id 502 \
  --data_dir data/rejected_papers_iclr_2017_with_figures \
  --output_dir results

Injected mode:

python main.py \
  --model gpt-4o \
  --attack_mode injected \
  --injection_location conclusion \
  --paper_id 502,503 \
  --seed 42 \
  --output_dir results

2) Run GCG Attack

python main_gcg.py \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --paper_id 502 \
  --mode gcg \
  --injection_location conclusion \
  --gcg_steps 250 \
  --output_dir results_gcg

3) Run Image Attacks (VLM)

python main_image.py \
  --model llava-hf/llava-1.5-7b-hf \
  --paper_id 502 \
  --mode pgd \
  --steps 50 \
  --epsilon 0.0313725 \
  --output_dir results_image_attack

Modes:

pgd
auto-pgd
cw

4) Evaluate Defenses

python main_defense.py \
  --defense-method llm \
  --judge-model gpt-4o \
  --base-dir /data \
  --output defense_results.json \
  --a1-size 10 \
  --seed 42

Embedding-based defense:

python main_defense.py \
  --defense-method embed \
  --model-name text-embedding-3-large \
  --top-k 5 \
  --data-dirs data/rejected_papers_iclr_2017_with_figures \
  --output defense_results_embed.json

Post-hoc Attack Evaluation

Use tools/eval.py to compute ASR and significance tests from paired normal/injected outputs:

python tools/eval.py \
  --model gpt-4o \
  --seed 42 \
  --normal-dir /data/all_normal \
  --injected-dir /data/all_injected \
  --attack-name prompt-injection

Review Output Schema

The parser expects:

REVIEW with aspect tags such as:
- [SUMMARY]
- [MOTIVATION POSITIVE/NEGATIVE]
- [SUBSTANCE POSITIVE/NEGATIVE]
- [ORIGINALITY POSITIVE/NEGATIVE]
- [SOUNDNESS POSITIVE/NEGATIVE]
- [CLARITY POSITIVE/NEGATIVE]
- [REPLICABILITY POSITIVE/NEGATIVE]
- [MEANINGFUL COMPARISON POSITIVE/NEGATIVE]
REVIEW SCORE fields:
- OVERALL, SUBSTANCE, APPROPRIATENESS, MEANINGFUL_COMPARISON, SOUNDNESS_CORRECTNESS, ORIGINALITY, CLARITY, IMPACT

Each score is expected in [1, 10].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaperGuard

Repository Layout

Installation

Data

(Re)building the dataset from raw data

Usage

1) Generate Normal or Prompt-Injected Reviews

2) Run GCG Attack

3) Run Image Attacks (VLM)

4) Evaluate Defenses

Post-hoc Attack Evaluation

Review Output Schema

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
attack		attack
data_processing		data_processing
defense		defense
models		models
prompts		prompts
tools		tools
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
environment.yaml		environment.yaml
main.py		main.py
main_defense.py		main_defense.py
main_gcg.py		main_gcg.py
main_image.py		main_image.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PaperGuard

Repository Layout

Installation

Data

(Re)building the dataset from raw data

Usage

1) Generate Normal or Prompt-Injected Reviews

2) Run GCG Attack

3) Run Image Attacks (VLM)

4) Evaluate Defenses

Post-hoc Attack Evaluation

Review Output Schema

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages