Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

OmniGen2 (MindSpore)

Efficient MindSpore implementation of OmniGen2: a unified multimodal image generation and editing framework supporting text-to-image, instruction-guided image editing, and in-context generation.

Overview

OmniGen2 is a powerful and efficient generative model. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. OmniGen2 has competitive performance across four primary capabilities:

  • Visual Understanding: Inherits the robust ability to interpret and analyze image content from its Qwen-VL-2.5 foundation.
  • Text-to-Image Generation: Creates high-fidelity and aesthetically pleasing images from textual prompts.
  • Instruction-guided Image Editing: Executes complex, instruction-based image modifications with high precision, achieving state-of-the-art performance among open-source models.
  • In-context Generation: A versatile capability to process and flexibly combine diverse inputs—including humans, reference objects, and scenes—to produce novel and coherent visual outputs.

News

  • MindSpore inference pipeline and Gradio demo are available under examples/omnigen2/.
  • Example presets are provided via configs/app.yaml and support URL-based images.

📦 Requirements

mindspore ascend driver cann
>=2.6.0 >=24.1.RC1 >=8.1.RC1
  1. Install MindSpore and Ascend software per the official docs:

  2. Install Python dependencies:

    cd examples/omnigen2
    pip install -r requirements.txt

Demo

Text-to-Image
1024x1024 1024x1024
CaptionThe sun rises slightly, the dew on the rose petals in the garden is clear, a crystal ladybug is crawling to the dew, the background is the early morning garden, macro lens.
CaptionA snow maiden with pale translucent skin, frosty white lashes, and a soft expression of longing
CaptionThis dreamlike digital art captures a vibrant, kaleidoscopic bird in a lush rainforest
CaptionA cat holds a white board writing text "OmniGen2" and a red heart
Image Editing
Prompt Input Output
Raise his hand
Convert this image into
Ghibli style
Change the dress to blue.
Generate an anime-style
figurine based on
the original image
Make him smile
In-context Generation
Prompt Input 1 Input 2 Output
In a cozy café,
the anime figure is sitting
in front of a laptop,
smiling confidently.
-
Let the girl and the boy
get married in the church.
The two people shown in the
images are sitting in a theater,
watching the screen.
One person points at the other person.
Replace the woman in
the second image
with the woman from
the first image
Visual Understanding
Input Prompt Output
Please briefly describe this image. The image is a close-up of an anime character with light-colored hair and blue eyes. The character is wearing a green suit with a red tie and a white shirt. The background appears to be an indoor setting, possibly a room with wooden paneling. The character's expression is neutral or slightly serious.
Please describe this image briefly. The image shows a plush toy bear sitting on a grassy surface. The bear has a brown body with white paws and a white muzzle. It is wearing a blue bow on its head and a white bib with the text "Get Well" written on it. The background consists of green grass with some small plants and clover.

Model Weights

OmniGen2 weights and assets are hosted on Hugging Face.

hf download OmniGen2/OmniGen2 --exclude "assets/*"

Tip

For users in Mainland China set the HF_ENDPOINT=https://hf-mirror.com environment variable.

Inference

Command Line

Usage examples are available under scripts/run. These scripts provide ready-to-use commands for common inference scenarios and demonstrate various OmniGen2 capabilities.

For full list of flags, see python scripts/inference.py --help.

Speedup Inference with Caching

  • For TeaCache (~30% speedup at default threshold), add the following flags:
--enable_teacache --teacache_rel_l1_thresh 0.05
  • For TaylorSeer (up to ~2× speedup, mutually exclusive with TeaCache):
--enable_taylorseer

Gradio App

A local demo UI is available at app.py.

pip install gradio
python app.py

Usage Tips

To achieve optimal results with OmniGen2, you can adjust the following key hyperparameters based on your specific use case.

  • Guidance scales
    • text_guidance_scale: stronger adherence to text (default 5.0)
    • image_guidance_scale: stronger adherence to input images (edit/in-context). Try 1.2–2.0 for editing; 2.5–3.0 for in-context.
  • Scheduler: euler (default) or dpmsolver++ for potentially fewer steps at similar quality.
  • CFG range: Lower --cfg_range_end can reduce latency with minor quality impact.
  • Prompts: Be specific. English prompts work best currently. Longer, descriptive prompts often help.
  • Inputs: Prefer clear images ≥ 512×512.

Training

1. Preparation

Before launching the training, you need to prepare the following configuration files.

Step 1: Set Up the Training Configuration

This is a YAML file that specifies crucial parameters for your training job, including the model architecture, optimizer, dataset paths, and validation settings.

We provide two templates to get you started:

  • Full-Parameter Fine-Tuning: configs/finetune/ft.yml
  • LoRA Fine-Tuning: configs/finetune/ft_lora.yml

Copy one of these templates and modify it according to your needs. Below are some of the most important parameters you may want to adjust:

  • name: The experiment name. This is used to create a directory for logs and saved model weights (e.g., experiments/your_exp_name).
  • data.config_path: Path to the data configuration file that defines your training data sources and mixing ratios.
  • data.max_output_pixels: The maximum number of pixels for an output image. Larger images will be downsampled while maintaining their aspect ratio.
  • data.max_input_pixels: A list specifying the maximum pixel count for input images, corresponding to one, two, three, or more inputs.
  • data.max_side_length: The maximum side length for any image (input or output). Images exceeding this will be downsampled while maintaining their aspect ratio.
  • dataloader.batch_size: The batch size per NPU.
  • train.steps: The total number of training steps to run.
  • train.lr_scheduler.lr: The learning rate for the optimizer. Note: This often requires tuning based on your dataset size and whether you are using LoRA. We recommend using lower learning rate for full-parameter fine-tuning.

Step 2: Configure Your Dataset

The data configuration consists of a set of yaml and jsonl files.

  • The .yml file defines the mixing ratios for different data sources.
  • The .jsonl files contain the actual data entries, with each line representing a single data sample.

For a practical example, please refer to configs/finetune/data/mix.yml. Each line in a .jsonl file describes a sample, generally following this format:

{
  "task_type": "edit",
  "instruction": "add a hat to the person",
  "input_images": [
    "/path/to/your/data/edit/input1.png",
    "/path/to/your/data/edit/input2.png"
  ],
  "output_image": "/path/to/your/data/edit/output.png"
}

Note: The input_images field can be omitted for text-to-image (T2I) tasks.

2. 🚀 Launching the Training

Once your configuration is ready, you can launch the training script. All experiment artifacts, including logs and checkpoints, will be saved in experiments/${experiment_name}.

We provide convenient shell scripts to handle the complexities of launching distributed training jobs. You can use them directly or adapt them for your environment.

  • For Full-Parameter Fine-Tuning: scripts/run/ft.sh
  • For LoRA Fine-Tuning: scripts/run/ft_lora.sh

⚠️ Note on LoRA Checkpoints: Currently, when training with LoRA, the script saves the entire model's parameters (including the frozen base model weights) in the checkpoint. This is due to a limitation in easily extracting only the LoRA-related parameters when using FSDP.

Performance

Inference

Model Mode Cards Precision Number of
input images
Resolution Scheduler Steps s/img
OmniGen2 Text-to-Image 1 BF16 - 1024x1024 Euler 50 120
OmniGen2 Image Editing 1 BF16 1 832x1248 Euler 50 282
OmniGen2 In-context Generation 1 BF16 1 768x1152 Euler 50 248
OmniGen2 In-context Generation 1 BF16 2 1024x1024 Euler 50 870

Training

Model Fine-tuning Cards Batch size Resolution Precision s/step Recipe
OmniGen2 Full 8 1 720x720 BF16 5.03 ft.yml
OmniGen2 LoRA 8 1 720x720 BF16 3.78 ft_lora.yml

Acknowledgement

If you find OmniGen2 useful, please cite the original work:

@article{wu2025omnigen2,
  title={OmniGen2: Exploration to Advanced Multimodal Generation},
  author={Chenyuan Wu and Pengfei Zheng and Ruiran Yan and Shitao Xiao and Xin Luo and Yueze Wang and Wanli Li and Xiyan Jiang and Yexin Liu and Junjie Zhou and Ze Liu and Ziyi Xia and Chaofan Li and Haoge Deng and Jiahao Wang and Kun Luo and Bo Zhang and Defu Lian and Xinlong Wang and Zhongyuan Wang and Tiejun Huang and Zheng Liu},
  journal={arXiv preprint arXiv:2506.18871},
  year={2025}
}