Skip to content

HuiZhang0812/WeEdit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WeEdit Logo WeEdit


HuggingFace

WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing
Hui Zhang1,2, Juntao Liu1, Zongkai Liu1,3, Liqiang Niu1, Fandong Meng1, Zuxuan Wu2, and Yu-Gang Jiang2
1WeChat AI, Tencent, 2Fudan University, 3Sun Yat-sen University

Introduction

WeEdit is a systematic framework for text-centric image editing, addressing the challenges of modifying, translating, and rearranging textual elements embedded within images.

WeEdit Dataset 🗂️: A large-scale dataset of 330K text-centric editing pairs constructed via a novel HTML-based automatic pipeline, covering 7 editing operations and 15 languages.

WeEdit Benchmark 📊: Standardized bilingual (Chinese-English) and multilingual (15 languages) benchmarks with 2,000 test cases each, covering 8 editing operations (Add, Replace, Delete, Rearrange, Translate, Change Style, Combined, and Reasoning) for comprehensive evaluation.

Glyph-Guided SFT ✏️: A supervised fine-tuning stage that injects rendered glyph images as explicit spatial priors, enabling precise text placement and character-level fidelity.

Multi-Objective RL 🎯: A reinforcement learning stage with separate reward models targeting instruction adherence, text clarity, background preservation, and relative quality.

Dataset and Benchmark

WeEdit Dataset

Our WeEdit dataset contains 330K high-quality text-centric image editing pairs constructed through two complementary pipelines:

  • Structured Data (~170K): A novel HTML-based pipeline converts source images to HTML, extracts and edits text content via a VLM, and renders both source and target images through a headless browser, yielding pixel-perfect editing pairs.
  • Unstructured Data (~160K): An automated edit-verify-and-retry pipeline operates directly at the image level for images with complex layouts, diverse typography, and text tightly entangled with complex visual backgrounds.

The dataset covers 7 editing operation types (Add, Replace, Delete, Rearrange, Translate, Change Style, Combined) and 15 languages (English, Chinese, Hindi, Spanish, French, Arabic, Portuguese, Bengali, Russian, German, Korean, Japanese, Thai, Indonesian, Vietnamese).

WeEdit Benchmark HuggingFace

Our comprehensive benchmark evaluates text-centric image editing capabilities across multiple dimensions:

  • Bilingual Benchmark: 2,000 test cases covering Chinese and English
  • Multilingual Benchmark: 2,000 test cases spanning 15 languages
  • 8 Task Categories: Add, Replace, Delete, Rearrange, Translate, Change Style, Combined, and Reasoning
  • 3 Evaluation Dimensions: Instruction Adherence (IA), Text Clarity (TC), and Background Preservation (BP)

Evaluation

To evaluate a model's text-centric image editing capabilities on our benchmark:

  1. Generate edited images and save them to a results directory with a generated_imgs/ subfolder. Each image should be named as {img_id}_{instruction_type}.png, where img_id and instruction_type are from the corresponding benchmark item.

  2. Implement your own Gemini-3-Pro API call in evaluation/evaluation_benchmark.py by filling in the call_gemini() function.

  3. Run the evaluation script:

Evaluate on the Bilingual Benchmark:

python evaluation/evaluation_benchmark.py \
    --results_dir <path_to_results> \
    --benchmark_file benchmark/Bilingual_benchmark.jsonl

Evaluate on the Multilingual Benchmark:

python evaluation/evaluation_benchmark.py \
    --results_dir <path_to_results> \
    --benchmark_file benchmark/Multilingual_benchmark.jsonl

The evaluation uses Gemini-3-Pro as an impartial VLM judge to score edited images across Instruction Adherence, Text Clarity, and Background Preservation on a 0-9 scale.

Main Results

Bilingual Benchmark

Multilingual Benchmark

WeEdit achieves the best performance among open-source models on both benchmarks, surpassing most proprietary models and ranking second only to Gemini-3-Pro-Image.

Citation

If you find our work useful for your research and applications, please kindly cite using this BibTeX:

@article{zhang2026weedit,
  title={WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing},
  author={Zhang, Hui and Liu, Juntao and Liu, Zongkai and Niu, Liqiang and Meng, Fandong and Wu, Zuxuan and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2603.11593},
  year={2026}
}

About

A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages