Skip to content

[SIGGRAPH Asia 2025] Official Implementation of "ConsistEdit: Highly Consistent and Precise Training-free Visual Editing"

License

Notifications You must be signed in to change notification settings

zxYin/ConsistEdit_Code

Repository files navigation

ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Zixin Yin1, Ling-Hao Chen2,3, Lionel Ni1,4, Xili Dai4

1HKUST, 2Tsinghua University, 3IDEA Research, 4HKUST(GZ)

✨ACM SIGGRAPH Asia 2025✨

🎯 Demo

ConsistEdit Multi-Step Editing Demo

Source Video Edited Video
Source Target

πŸ”§ Setup

Requirements

pip install -r requirements.txt

Model Preparation

Download the required diffusion models:

  • Stable Diffusion 3: /path/to/stable-diffusion-3-medium-diffusers
  • FLUX.1-dev: /path/to/FLUX.1-dev
  • CogVideoX-2b: /path/to/CogVideoX-2b

Update the model paths in the scripts accordingly.

πŸš€ Quick Start

Using Scripts

We provide two demonstration scripts in the script/ directory:

1. Consistent Editing (Change Color/Material)

bash script/sd3_consist_edit.sh

2. Inconsistent Editing (Change Style/Object)

bash script/sd3_inconsist_edit.sh

Manual Usage

Stable Diffusion 3

python run_synthesis_sd3.py \
    --src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
    --tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
    --edit_object "dress" \
    --out_dir "output" \
    --alpha 1.0 \
    --model_path "/path/to/stable-diffusion-3-medium-diffusers"
python run_synthesis_sd3.py \
    --src_prompt "a portrait of a woman in a red dress, realistic style, best quality" \
    --tgt_prompt "a portrait of a woman in a yellow dress, cartoon style, best quality" \
    --edit_object "dress" \
    --out_dir "output" \
    --alpha 0.3 \
    --model_path "/path/to/stable-diffusion-3-medium-diffusers"

FLUX

python run_synthesis_flux.py \
    --src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
    --tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
    --edit_object "dress" \
    --out_dir "output" \
    --alpha 1.0 \
    --model_path "/path/to/FLUX.1-dev"

CogVideo

python run_synthesis_cog.py \
    --src_prompt "a portrait of a woman in a red dress in a forest, best quality" \
    --tgt_prompt "a portrait of a woman in a yellow dress in a forest, best quality" \
    --edit_object "dress" \
    --out_dir "output" \
    --alpha 1.0 \
    --model_path "/path/to/CogVideoX-2b"

Real Image Editing

python run_real_sd3.py \
    --src_prompt "a girl with a red hat and red t-shirt is sitting in a park, best quality" \
    --tgt_prompt "a girl with a yellow hat and red t-shirt is sitting in a park, best quality" \
    --edit_object "hat" \
    --source_image_path "assets/red_hat_girl.png" \
    --out_dir "output" \
    --alpha 0.1 \
    --model_path "/path/to/stable-diffusion-3-medium-diffusers"

🎭 Masking Strategies

Three Approaches Explained

1. No Mask (--no_mask)

What it does: Disables masking and content fusion entirely.

Result: Colors in non-editing regions may change uncontrollably.

python run_synthesis_sd3.py --no_mask --alpha 0.3 ...

2. Old Mask Method (--use_old_mask)

What it does: Original paper implementation with computational efficiency but inconsistency.

Technical Details:

  • Mask Calculation: Uses vanilla attention computation
  • Image Generation: Uses scaled dot-product attention computation

Result: ⚠️ Non-editing regions are suboptimal due to attention mismatch

3. New Mask Method (Default, use_old_mask=False)

What it does: Our improved method with computational consistency

Technical Details:

  • Both Mask & Generation: Use vanilla attention computation

Result: βœ… Optimal background preservation with perfect computational alignment

Default (recommended): New mask method is used automatically.

βš™οΈ Parameters

Common Parameters

Parameter Type Default Description
--src_prompt str Required Source image prompt: Text description used to generate the source image. This defines the initial state before editing.
--tgt_prompt str Required Target image prompt: Text description for the edited result.
--edit_object str Required Edit object word: Single word or phrase that appears in src_prompt and specifies what object to edit. Used for mask generation.
--out_dir str "output" Output directory: Directory where generated images and masks will be saved.
--alpha float 1.0 Consistency strength: Controls the strength of cross-attention injection (consistency_strength in paper). Range: 0.0-1.0.
--model_path str Required Model path: Local path to the diffusion model directory.
--no_mask flag False Disable masking: When set, no mask is generated and no content fusion is applied. Use this to observe uncontrolled changes.
--use_old_mask flag False Use paper method: Enables the original paper's masking approach. Uses scale dot-product attention for generation (less accurate).

Model-Specific Parameters

Real Image Editing (run_real_sd3.py)

Parameter Type Default Description
--source_image_path str "assets/red_hat_girl.png" Input image path

πŸ“Š Evaluation

PIE-Bench Evaluation

To generate results for PIE-Bench evaluation:

python run_metric.py \
    --model_path "/path/to/stable-diffusion-3-medium-diffusers" \
    --data_path "/path/to/pie-bench-dataset"

This script processes the PIE-Bench dataset and generates edited images for quantitative evaluation.

Metric Calculation

To compute evaluation metrics:

python evaluate_sd3.py

πŸ“ Project Structure

ConsistEdit_Code/
β”œβ”€β”€ run_synthesis_sd3.py      # SD3 synthesis editing
β”œβ”€β”€ run_synthesis_flux.py     # FLUX synthesis editing
β”œβ”€β”€ run_synthesis_cog.py      # CogVideo editing
β”œβ”€β”€ run_real_sd3.py          # Real image editing
β”œβ”€β”€ run_metric.py            # PIE-Bench evaluation script
β”œβ”€β”€ evaluate_sd3.py          # Metric calculation script
β”œβ”€β”€ demo_sd3_masking.ipynb   # Interactive demonstration
β”œβ”€β”€ script/
β”‚   β”œβ”€β”€ sd3_consist_edit.sh   # Consistent editing demo
β”‚   └── sd3_inconsist_edit.sh # Inconsistent editing demo
β”œβ”€β”€ consistEdit/
β”‚   β”œβ”€β”€ attention_control.py  # Cross-attention mechanisms
β”‚   β”œβ”€β”€ solver.py            # Diffusion solvers
β”‚   β”œβ”€β”€ utils.py             # Utility functions
β”‚   └── global_var.py        # Global variables
β”œβ”€β”€ evaluation/
β”‚   └── matric_calculator.py  # Evaluation metrics
└── assets/                   # Sample images

πŸ™ Acknowledgments

This codebase is built upon and inspired by several excellent open-source projects:

  • MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing
  • PnPInversion: Plug-and-Play diffusion features for text-driven image-to-image translation
  • UniEdit-Flow: UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models
  • DiTCtrl: DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

We thank the authors of these works for their valuable contributions to the diffusion model editing community.

πŸ“– Citation

If you find this work useful, please cite our paper:

@inproceedings{yin2025consistedit,
  title={ConsistEdit: Highly Consistent and Precise Training-free Visual Editing},
  author={Yin, Zixin and Chen, Ling-Hao and Ni, Lionel and Dai, Xili},
  booktitle={SIGGRAPH Asia 2025 Conference Papers},
  year={2025},
  publisher={ACM},
  doi={10.1145/3757377.3763909},
  address={Hong Kong, China},
  isbn={979-8-4007-2137-3/2025/12}
}

About

[SIGGRAPH Asia 2025] Official Implementation of "ConsistEdit: Highly Consistent and Precise Training-free Visual Editing"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published