Efficient MindSpore implementation of OmniGen2: a unified multimodal image generation and editing framework supporting text-to-image, instruction-guided image editing, and in-context generation.
OmniGen2 is a powerful and efficient generative model. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. OmniGen2 has competitive performance across four primary capabilities:
- Visual Understanding: Inherits the robust ability to interpret and analyze image content from its Qwen-VL-2.5 foundation.
- Text-to-Image Generation: Creates high-fidelity and aesthetically pleasing images from textual prompts.
- Instruction-guided Image Editing: Executes complex, instruction-based image modifications with high precision, achieving state-of-the-art performance among open-source models.
- In-context Generation: A versatile capability to process and flexibly combine diverse inputs—including humans, reference objects, and scenes—to produce novel and coherent visual outputs.
- MindSpore inference pipeline and Gradio demo are available under
examples/omnigen2/. - Example presets are provided via
configs/app.yamland support URL-based images.
| mindspore | ascend driver | cann |
|---|---|---|
| >=2.6.0 | >=24.1.RC1 | >=8.1.RC1 |
-
Install MindSpore and Ascend software per the official docs:
-
Install Python dependencies:
cd examples/omnigen2 pip install -r requirements.txt
Text-to-Image
Image Editing
| Prompt | Input | Output |
|---|---|---|
| Raise his hand | ![]() |
![]() |
| Convert this image into Ghibli style |
![]() |
![]() |
| Change the dress to blue. | ![]() |
![]() |
| Generate an anime-style figurine based on the original image |
![]() |
|
| Make him smile | ![]() |
![]() |
In-context Generation
Visual Understanding
OmniGen2 weights and assets are hosted on Hugging Face.
hf download OmniGen2/OmniGen2 --exclude "assets/*"Tip
For users in Mainland China set the HF_ENDPOINT=https://hf-mirror.com environment variable.
Usage examples are available under scripts/run. These scripts provide ready-to-use commands for common inference
scenarios and demonstrate various OmniGen2 capabilities.
For full list of flags, see python scripts/inference.py --help.
- For TeaCache (~30% speedup at default threshold), add the following flags:
--enable_teacache --teacache_rel_l1_thresh 0.05- For TaylorSeer (up to ~2× speedup, mutually exclusive with TeaCache):
--enable_taylorseerA local demo UI is available at app.py.
pip install gradio
python app.pyTo achieve optimal results with OmniGen2, you can adjust the following key hyperparameters based on your specific use case.
- Guidance scales
text_guidance_scale: stronger adherence to text (default 5.0)image_guidance_scale: stronger adherence to input images (edit/in-context). Try 1.2–2.0 for editing; 2.5–3.0 for in-context.
- Scheduler:
euler(default) ordpmsolver++for potentially fewer steps at similar quality. - CFG range: Lower
--cfg_range_endcan reduce latency with minor quality impact. - Prompts: Be specific. English prompts work best currently. Longer, descriptive prompts often help.
- Inputs: Prefer clear images ≥ 512×512.
Before launching the training, you need to prepare the following configuration files.
This is a YAML file that specifies crucial parameters for your training job, including the model architecture, optimizer, dataset paths, and validation settings.
We provide two templates to get you started:
- Full-Parameter Fine-Tuning:
configs/finetune/ft.yml - LoRA Fine-Tuning:
configs/finetune/ft_lora.yml
Copy one of these templates and modify it according to your needs. Below are some of the most important parameters you may want to adjust:
name: The experiment name. This is used to create a directory for logs and saved model weights (e.g.,experiments/your_exp_name).data.config_path: Path to the data configuration file that defines your training data sources and mixing ratios.data.max_output_pixels: The maximum number of pixels for an output image. Larger images will be downsampled while maintaining their aspect ratio.data.max_input_pixels: A list specifying the maximum pixel count for input images, corresponding to one, two, three, or more inputs.data.max_side_length: The maximum side length for any image (input or output). Images exceeding this will be downsampled while maintaining their aspect ratio.dataloader.batch_size: The batch size per NPU.train.steps: The total number of training steps to run.train.lr_scheduler.lr: The learning rate for the optimizer. Note: This often requires tuning based on your dataset size and whether you are using LoRA. We recommend using lower learning rate for full-parameter fine-tuning.
The data configuration consists of a set of yaml and jsonl files.
- The
.ymlfile defines the mixing ratios for different data sources. - The
.jsonlfiles contain the actual data entries, with each line representing a single data sample.
For a practical example, please refer to configs/finetune/data/mix.yml.
Each line in a .jsonl file describes a sample, generally following this format:
{
"task_type": "edit",
"instruction": "add a hat to the person",
"input_images": [
"/path/to/your/data/edit/input1.png",
"/path/to/your/data/edit/input2.png"
],
"output_image": "/path/to/your/data/edit/output.png"
}Note: The input_images field can be omitted for text-to-image (T2I) tasks.
Once your configuration is ready, you can launch the training script. All experiment artifacts, including logs and
checkpoints, will be saved in experiments/${experiment_name}.
We provide convenient shell scripts to handle the complexities of launching distributed training jobs. You can use them directly or adapt them for your environment.
- For Full-Parameter Fine-Tuning:
scripts/run/ft.sh - For LoRA Fine-Tuning:
scripts/run/ft_lora.sh
⚠️ Note on LoRA Checkpoints: Currently, when training with LoRA, the script saves the entire model's parameters (including the frozen base model weights) in the checkpoint. This is due to a limitation in easily extracting only the LoRA-related parameters when using FSDP.
| Model | Mode | Cards | Precision | Number of input images |
Resolution | Scheduler | Steps | s/img |
|---|---|---|---|---|---|---|---|---|
| OmniGen2 | Text-to-Image | 1 | BF16 | - | 1024x1024 | Euler | 50 | 120 |
| OmniGen2 | Image Editing | 1 | BF16 | 1 | 832x1248 | Euler | 50 | 282 |
| OmniGen2 | In-context Generation | 1 | BF16 | 1 | 768x1152 | Euler | 50 | 248 |
| OmniGen2 | In-context Generation | 1 | BF16 | 2 | 1024x1024 | Euler | 50 | 870 |
| Model | Fine-tuning | Cards | Batch size | Resolution | Precision | s/step | Recipe |
|---|---|---|---|---|---|---|---|
| OmniGen2 | Full | 8 | 1 | 720x720 | BF16 | 5.03 | ft.yml |
| OmniGen2 | LoRA | 8 | 1 | 720x720 | BF16 | 3.78 | ft_lora.yml |
If you find OmniGen2 useful, please cite the original work:
@article{wu2025omnigen2,
title={OmniGen2: Exploration to Advanced Multimodal Generation},
author={Chenyuan Wu and Pengfei Zheng and Ruiran Yan and Shitao Xiao and Xin Luo and Yueze Wang and Wanli Li and Xiyan Jiang and Yexin Liu and Junjie Zhou and Ze Liu and Ziyi Xia and Chaofan Li and Haoge Deng and Jiahao Wang and Kun Luo and Bo Zhang and Defu Lian and Xinlong Wang and Zhongyuan Wang and Tiejun Huang and Zheng Liu},
journal={arXiv preprint arXiv:2506.18871},
year={2025}
}

























