WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
Hui Zhang1,2, Juntao Liu1, Zongkai Liu1,3, Liqiang Niu1, Fandong Meng1, Zuxuan Wu2, and Yu-Gang Jiang2
1WeChat AI, Tencent, 2Fudan University, 3Sun Yat-sen University
WeEdit is a systematic framework for text-centric image editing, addressing the challenges of modifying, translating, and rearranging textual elements embedded within images.
WeEdit Dataset 🗂️: A large-scale dataset of 330K text-centric editing pairs constructed via a novel HTML-based automatic pipeline, covering 7 editing operations and 15 languages.
WeEdit Benchmark 📊: Standardized bilingual (Chinese-English) and multilingual (15 languages) benchmarks with 2,000 test cases each, covering 8 editing operations (Add, Replace, Delete, Rearrange, Translate, Change Style, Combined, and Reasoning) for comprehensive evaluation.
Glyph-Guided SFT ✏️: A supervised fine-tuning stage that injects rendered glyph images as explicit spatial priors, enabling precise text placement and character-level fidelity.
Multi-Objective RL 🎯: A reinforcement learning stage with separate reward models targeting instruction adherence, text clarity, background preservation, and relative quality.
Our WeEdit dataset contains 330K high-quality text-centric image editing pairs constructed through two complementary pipelines:
- Structured Data (~170K): A novel HTML-based pipeline converts source images to HTML, extracts and edits text content via a VLM, and renders both source and target images through a headless browser, yielding pixel-perfect editing pairs.
- Unstructured Data (~160K): An automated edit-verify-and-retry pipeline operates directly at the image level for images with complex layouts, diverse typography, and text tightly entangled with complex visual backgrounds.
The dataset covers 7 editing operation types (Add, Replace, Delete, Rearrange, Translate, Change Style, Combined) and 15 languages (English, Chinese, Hindi, Spanish, French, Arabic, Portuguese, Bengali, Russian, German, Korean, Japanese, Thai, Indonesian, Vietnamese).
Our comprehensive benchmark evaluates text-centric image editing capabilities across multiple dimensions:
- Bilingual Benchmark: 2,000 test cases covering Chinese and English
- Multilingual Benchmark: 2,000 test cases spanning 15 languages
- 8 Task Categories: Add, Replace, Delete, Rearrange, Translate, Change Style, Combined, and Reasoning
- 3 Evaluation Dimensions: Instruction Adherence (IA), Text Clarity (TC), and Background Preservation (BP)
To evaluate a model's text-centric image editing capabilities on our benchmark:
-
Generate edited images and save them to a results directory with a
generated_imgs/subfolder. Each image should be named as{img_id}_{instruction_type}.png, whereimg_idandinstruction_typeare from the corresponding benchmark item. -
Implement your own Gemini-3-Pro API call in
evaluation/evaluation_benchmark.pyby filling in thecall_gemini()function. -
Run the evaluation script:
Evaluate on the Bilingual Benchmark:
python evaluation/evaluation_benchmark.py \
--results_dir <path_to_results> \
--benchmark_file benchmark/Bilingual_benchmark.jsonlEvaluate on the Multilingual Benchmark:
python evaluation/evaluation_benchmark.py \
--results_dir <path_to_results> \
--benchmark_file benchmark/Multilingual_benchmark.jsonlThe evaluation uses Gemini-3-Pro as an impartial VLM judge to score edited images across Instruction Adherence, Text Clarity, and Background Preservation on a 0-9 scale.
WeEdit achieves the best performance among open-source models on both benchmarks, surpassing most proprietary models and ranking second only to Gemini-3-Pro-Image.
If you find our work useful for your research and applications, please kindly cite using this BibTeX:
@article{zhang2026weedit,
title={WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing},
author={Zhang, Hui and Liu, Juntao and Liu, Zongkai and Niu, Liqiang and Meng, Fandong and Wu, Zuxuan and Jiang, Yu-Gang},
journal={arXiv preprint arXiv:2603.11593},
year={2026}
}





