PRISM-Bench

This repository contains the code and data for PRISM-Bench, a benchmark designed to evaluate multimodal large language models (MLLMs) on reasoning with complex visual puzzles.

🎉 Update (March 2026): PRISM-Bench Data Fully Rereleased! > Following a rigorous manual review and multiple rounds of auditing, we have resolved the previous data pipeline and CoT hallucination issues. The entire benchmark has been updated and is now fully available. We deeply appreciate the community's feedback, which has been instrumental in making PRISM-Bench more robust.

For full transparency regarding our recent data quality improvements and bug fixes, please see our Detailed Release Notes & Incident Archive.

🚀 Features

Diverse Visual Reasoning Tasks: puzzles, graph-based reasoning, pattern recognition, algorithmic reasoning, etc.
Chain-of-Thought (CoT) Annotations: ground-truth reasoning steps for each problem.
Instruction Corruption: synthetic corrupted reasoning chains for robustness testing.
First-Error Detection: annotations for the first mistake in incorrect reasoning chains.
VQA-Style Evaluation: multiple-choice format with definitive ground-truth answers.

📊 Dataset Format

Each entry in the benchmark is stored as a JSON object:

{
  "id": 2,
  "image_url": "https://example.com/images/question_0002.png",
  "question_text": "Figure Reasoning",
  "answer": "D",
  "groundtruth_cot": "Step 1) ... Step 2) ... Step 3) ...",
  "cot_corrupted": "Step 1) ... (error inserted)",
  "first_error": "Step 2"
}

notes:image_url: external link to the image (not hosted in this repo). Users should download/copy the images locally.Some URLs may expire; please handle missing downloads gracefully.

⚡ Quick Start

Clone the repository
Download images We provide a helper script to cache images locally:

python data/download_images.py \
  --input data/download_images_url.jsonl \
  --output-dir data/images/

Run inference

We provide system prompt and format of output in inference/example_inference.py

📄 License

This project is distributed under the terms of the LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
inference		inference
LICENSE		LICENSE
README.md		README.md
Release_Notes_v2.md		Release_Notes_v2.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRISM-Bench

🚀 Features

📊 Dataset Format

⚡ Quick Start

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

PRISM-Bench

🚀 Features

📊 Dataset Format

⚡ Quick Start

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages