This repository is the official implementation of WISE.
Current default: WISE_Verified. If you need the original GPT-4o based WISE release, see WISE_legacy.
- 2026/04/19: We release WISE_Verified, a maintenance update for easier and lower-cost evaluation. It uses a vLLM-served Qwen3.5-35B-A3B judge, refreshes about 200 prompts, changes WiScore into a binary 0/1 score focused on world-knowledge consistency and realism, and updates the leaderboard with 21 models, including NanoBanana-Pro, GPT-Image-1.5, QwenImage, FLUX.2, BAGEL, and HunyuanImage.
- 2025/06/03: We updated the original code to provide clearer, simpler, and easier evaluation.
- 2025/05/24: We collected feedback and updated the original code. If you have any questions or comments, feel free to email us at niuyuwei04@gmail.com.
- 2025/03/11: We released our paper at https://arxiv.org/abs/2503.07265.
- 2025/03/10: We released the original code and data.
Text-to-Image (T2I) models can generate high-quality artistic creations and visual content. However, existing research and evaluation standards often focus on image realism and shallow text-image alignment, while lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation.
WISE is a benchmark for World Knowledge-Informed Semantic Evaluation. It moves beyond simple word-pixel mapping by challenging models with 1,000 prompts across cultural common sense, spatio-temporal reasoning, and natural science.
WISE_Verified is not WISE 2.0. It is a practical update of the original benchmark so that users can evaluate models with an open-source judge more conveniently, especially if GPT-4o-2024-05-13 becomes unavailable or too costly for large-scale evaluation.
WISE_Verified keeps the original goal of measuring world-knowledge consistency, but changes the default evaluation protocol:
- Open-source judge: We use Qwen3.5-35B-A3B through a vLLM OpenAI-compatible endpoint for evaluation.
- Verified prompts: About 200 WISE prompts were updated. Some original prompts were too easy, while others could trigger closed-source model policy restrictions during generation.
- Binary WiScore: WISE_Verified changes WiScore into a binary 0/1 score. We no longer separately score realism or aesthetic quality; each image is judged by whether it correctly realizes the prompt's world-knowledge meaning and is realistic and visually usable for evaluation.
- Updated leaderboard: We evaluated 21 models, including NanoBanana-Pro, GPT-Image-1.5, QwenImage, FLUX.2, BAGEL, and HunyuanImage. Some closed-source models or compute-heavy models are still missing because they do not provide usable APIs or exceed our current compute budget. We welcome model authors and users to contact us if they can provide results.
- data_verified/cultural_common_sense_verified.json: verified cultural common sense prompts, IDs 1-400.
- data_verified/spatio-temporal_reasoning_verified.json: verified time and space prompts, IDs 401-640.
- data_verified/natural_science_verified.json: verified biology, physics, and chemistry prompts, IDs 641-1000.
- data_verified/merge.json: optional merged copy of all 1,000 verified prompts.
- vllm_eval.py: evaluator for Qwen3.5-35B-A3B served by vLLM.
- calculate_verified.py: WISE_Verified score calculation script.
- eval_qwen.sh: end-to-end evaluation template.
- leadboard.md: full WISE_Verified leaderboard.
- WISE_legacy: archived original WISE release with GPT-4o evaluation, original data, original code, and assets.
Prepare generated images in one directory. The file names must match the prompt IDs:
IMAGE_DIR="/path/to/generated_images" # contains 1.png, 2.png, ..., 1000.pngStart a vLLM server that exposes an OpenAI-compatible chat completion endpoint for Qwen3.5-35B-A3B. See the official vLLM repository for installation and serving instructions. For example:
vllm serve /path/to/Qwen3.5-35B-A3B \
--served-model-name Qwen3.5-35B-A3B \
--host 0.0.0.0 \
--port 8000Then run the evaluation:
export IMAGE_DIR="/path/to/generated_images"
export VLLM_API_BASE="http://127.0.0.1:8000/v1"
export VLLM_API_KEY="EMPTY"
export JUDGE_MODEL="Qwen3.5-35B-A3B"
export MAX_WORKERS=96
bash eval_qwen.shThe script writes category-level outputs to ${IMAGE_DIR}/Results-qwen35/ and then calls calculate_verified.py to report the category scores and overall WISE_Verified score.
You can also run the scoring step manually after evaluation:
python calculate_verified.py \
"${IMAGE_DIR}/Results-qwen35/cultural_common_sense_scores_results.jsonl" \
"${IMAGE_DIR}/Results-qwen35/natural_science_scores_results.jsonl" \
"${IMAGE_DIR}/Results-qwen35/spatio-temporal_reasoning_scores_results.jsonl" \
--category allvllm_eval.py produces one binary score for each image:
Score: 1: the image is correct according to the prompt and explanation, reflects the intended world knowledge, and is realistic and visually usable for judging.Score: 0: the image misses the intended knowledge-based answer, has incorrect key relations, is unrealistic or too ambiguous to verify, or has generation failures that interfere with evaluation.
The per-sample WiScore is now this binary 0/1 score. The overall WISE_Verified score uses the following category weights:
Overall = 0.40 * CULTURE
+ 0.12 * TIME
+ 0.12 * SPACE
+ 0.12 * BIOLOGY
+ 0.12 * PHYSICS
+ 0.12 * CHEMISTRY
The full WISE_Verified leaderboard is available in leadboard.md.
| Rank | Model | Overall | CULTURE | TIME | SPACE | BIOLOGY | PHYSICS | CHEMISTRY |
|---|---|---|---|---|---|---|---|---|
| 1 | NanoBanana-Pro | 0.8760 | 0.8975 | 0.8167 | 0.9333 | 0.8167 | 0.8667 | 0.8750 |
| 2 | GPT-Image-1.5 | 0.8250 | 0.8900 | 0.6917 | 0.8833 | 0.8000 | 0.7583 | 0.7750 |
| 3 | BAGEL (w/ CoT) | 0.6280 | 0.7800 | 0.6333 | 0.5667 | 0.3750 | 0.5500 | 0.5083 |
| 4 | FLUX.2-dev | 0.5650 | 0.6650 | 0.5667 | 0.6583 | 0.3667 | 0.5250 | 0.3750 |
| 5 | QwenImage | 0.5100 | 0.6275 | 0.5250 | 0.5583 | 0.3417 | 0.4833 | 0.2500 |
| 6 | Qwen-Image-2512 | 0.4990 | 0.5950 | 0.4750 | 0.6000 | 0.3500 | 0.4917 | 0.2583 |
| 7 | Z-Image | 0.4530 | 0.5475 | 0.4667 | 0.5083 | 0.3250 | 0.4750 | 0.1750 |
| 8 | FLUX.2-klein-9B | 0.4400 | 0.4900 | 0.3917 | 0.5500 | 0.3833 | 0.4833 | 0.2250 |
| 9 | HunyuanImage-3.0 | 0.4350 | 0.5250 | 0.3917 | 0.4833 | 0.3083 | 0.4500 | 0.2417 |
| 10 | FLUX.1-dev | 0.4160 | 0.5225 | 0.4000 | 0.5333 | 0.1750 | 0.3750 | 0.2417 |
The original WISE release used GPT-4o-2024-05-13 to score consistency, realism, and aesthetic quality, then computed the original WiScore. That version is archived in WISE_legacy, including the original README, code, data, and figures.
Use the legacy version if you need to reproduce the original paper setting or compare against the old GPT-4o based scores.
@article{niu2025wise,
title={WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation},
author={Niu, Yuwei and Ning, Munan and Zheng, Mengren and Jin, Weiyang and Lin, Bin and Jin, Peng and Liao, Jiaqi and Ning, Kunpeng and Feng, Chaoran and Zhu, Bin and Yuan, Li},
journal={arXiv preprint arXiv:2503.07265},
year={2025}
}If you have questions, comments, or model results to add to the leaderboard, please contact Yuwei Niu at niuyuwei04@gmail.com.
If you are interested in unified multimodal models, Purshow/Awesome-Unified-Multimodal is a comprehensive resource for papers, code, and other materials.
