π MAC - A Live Benchmark for Multimodal Large Language Models in Scientific Understanding
- Two Task Types: Image-to-Text and Text-to-Image understanding
- Advanced Methods: DAD (Description and Deduction) methodology with multiple variants
- Multiple Models: Support for GPT-4o, Qwen2.5-VL, Step-1V, Gemini, and more
- Scientific Focus: Real scientific journal covers from Nature, Science, Cell, etc.
git clone https://github.com/mhjiang0408/MAC_Bench.git
cd MAC_Bench
chmod +x setup.sh
./setup.shThe setup script automatically:
- β
Creates environment from
environment.yml - β Installs CLI dependencies
- β Downloads dataset from Hugging Face
- β Sets up and verifies CLI tools
1. Create Configuration
mac config template --output config.yaml --type example
# Edit config.yaml with your API keys2. Run Experiment
mac run --config config.yaml3. Analyze Results
mac analyze experiment/results/Create your configuration file:
mac config template --output config.yaml --type basicExample configuration:
models:
- name: gpt-4o
api_base: https://api.openai.com/v1
api_key: sk-your-api-key
prompt_template: Config/prompt_template/4_choice_template.json
# If you are testing text2image tasks, you need to set prompt_template to Config/prompt_template/4_choice_template_given_cover_story.json
resume: false
resume_path: None
num_workers: 4
data:
data_path: MAC_Bench/image2text_info.csv
output_folder: ./experiment/results/
scaling_factor: 1.0
num_options: 4
type: image2text
# If you are testing text2image tasks, you need to set type to text2image
random_seed: 42| Command | Purpose | Example |
|---|---|---|
mac run |
Run experiments | mac run --config config.yaml |
mac analyze |
Analyze results | mac analyze experiment/results/ |
mac status |
Check system status | mac status --detailed |
mac config |
Manage configurations | mac config validate config.yaml |
For mac run:
--config config.yaml- Configuration file--models gpt-4o- Run specific model--scaling-factor 0.01- Use 1% of data for testing--dry-run- Preview without running--verbose- Detailed output
For mac analyze:
--output reports/- Output directory--format html- Report format (json/csv/html/all)--compare exp2.csv- Compare experiments--detailed- Include detailed analysis
# Test with 1% of data
mac run --config config.yaml --scaling-factor 0.01 --verbose# Run all models from config
mac run --config config.yaml
# Analyze with comprehensive reports
mac analyze experiment/results/ --output reports/ --format all# Run specific models
mac run --config config.yaml --models gpt-4o --models qwen-vl-max
# Compare results
mac analyze results1.csv --compare results2.csv --plot- Input: Scientific journal cover image
- Question: "Which of the following options best describe the cover image?"
- Options: 4 text descriptions (A, B, C, D)
- Goal: Select the most accurate description
- Input: Journal cover story text
- Question: "Which image best describes the cover story?"
- Options: 4 candidate images (A, B, C, D)
- Goal: Select the matching image
MAC_Bench/
βββ mac # CLI entry point
βββ setup.sh # One-click installation
βββ download_dataset.py # Dataset download script
βββ environment.yml # Conda environment
βββ requirements-cli.txt # CLI dependencies
β
βββ mac_cli/ # CLI implementation
β βββ commands/ # CLI commands
β βββ core/ # Core functionality
β βββ utils/ # Utilities
β
βββ Config/ # Configuration files
β βββ prompt_template/ # Prompt templates
β
βββ Dataset/ # Dataset construction scripts
βββ experiment/ # Experiment code
β βββ method/ # CoVR implementations
β βββ understanding/ # Task implementations
β
βββ utils/ # Core utilities
βββ MAC_Bench/ # Downloaded dataset
βββ experiment/results/ # Experiment outputs
mac status --detailed # Check what's wrongEnvironment Problems:
conda env update -f environment.ymlMissing Dependencies:
pip install -r requirements-cli.txtDataset Download Issues:
python download_dataset.py # Manual downloadAPI Connection Problems:
mac status --check-apis --config config.yaml
# Check your API keys in config.yamlThe MAC_Bench dataset is available on π€ Hugging Face and contains:
- Source Journals: Nature, Science, Cell, ACS Central Science
- Cover Images: High-resolution scientific journal covers
- Cover Stories: Corresponding textual descriptions
- Task Variants: Image2Text and Text2Image understanding
- Size: 10,000+ samples across multiple journals
The dataset is automatically downloaded during setup, but you can also download it manually:
# Via Hugging Face
from datasets import load_dataset
dataset = load_dataset("mhjiang0408/MAC_Bench")
# Or via download script
python download_dataset.py# More workers for faster processing
mac run --config config.yaml --workers 8
# Resume interrupted experiments
mac run --config config.yaml --resume# Group results by journal
mac analyze results/ --group-by journal
# Generate only JSON reports
mac analyze results/ --format json --no-plot- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License. See the LICENSE file for details.
If you use MAC_Bench in your research, please cite our paper:
@misc{jiang2025maclivebenchmarkmultimodal,
title={MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding},
author={Mohan Jiang and Jin Gao and Jiahao Zhan and Dequan Wang},
year={2025},
eprint={2508.15802},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.15802},
}