Skip to content

[diffusion] model: support JoyAI-Image-Edit#22625

Merged
mickqian merged 13 commits intosgl-project:mainfrom
lahmuller:joyimage-edit
May 2, 2026
Merged

[diffusion] model: support JoyAI-Image-Edit#22625
mickqian merged 13 commits intosgl-project:mainfrom
lahmuller:joyimage-edit

Conversation

@lahmuller
Copy link
Copy Markdown
Contributor

@lahmuller lahmuller commented Apr 12, 2026

Motivation

We are the JoyAI Team. This PR adds SGLang multimodal generation support for JoyAI-Image-Edit to enable image editing inference with JoyAI architecture in the existing diffusion pipeline framework. JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT).
Model: https://huggingface.co/jdopensource/JoyAI-Image-Edit-Diffusers

Modifications

  • Added JoyAI model registration in multimodal registry:
    • New HF model path detection and config mapping for JoyImageEditPipelineConfig + JoyImageEditSamplingParams.
  • Added JoyImage DiT config and runtime model implementation:
    • JoyImageArchConfig / JoyImageDiTConfig
    • Runtime JoyTransformer3DModel and Joy image-edit transformer path.
  • Added Qwen3-VL text encoder support for JoyImage:
    • New encoder config Qwen3VLConfig
    • Inference runtime implementation for Qwen3-VL text/vision-conditioned encoding.
  • Added JoyImage image-edit pipeline config and runtime pipeline:
    • JoyImageEditPipelineConfig
    • JoyImageEditPipeline with standard TI2I stage composition.
  • Generalized image-edit text postprocessing in image encoding stage:
    • Replaced Qwen-image-specific postprocess path with pipeline-configured postprocess_text_funcs.
  • Added JoyImage sampling defaults:
    • guidance_scale=4.0, num_inference_steps=40, num_frames=1, empty negative prompt.
  • Added Joy-related runtime compatibility updates:
    • Auto-enable enable_sequence_shard for Joy pipelines.
    • In ImagePipelineConfig.shard_latents_for_sp, bypass additional latent sharding when sequence shard is already enabled.
    • Added WanVAEConfig.get_vae_scale_factor() and post_init() fields for unified downstream scale-factor usage.

Accuracy Tests

N/A for this PR.
This change mainly integrates a new model/pipeline into the framework and does not modify existing model forward logic or kernel behavior.

Speed Tests and Profiling

N/A for this PR.
No speed regression/performance optimization is claimed in this change.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the JoyImage model, adding the DiT architecture, Qwen3-VL text encoder, and associated pipeline and sampling configurations. The implementation includes new configuration classes, runtime model definitions, and registration within the multimodal generation framework. Review feedback highlights several improvement opportunities, including refactoring the VAE configuration to follow standard dataclass patterns, moving stateful bucket initialization to prevent side effects during request processing, fixing a typo in a prompt template, and resolving variable shadowing in the encoder implementation to enhance maintainability.

Comment thread python/sglang/multimodal_gen/configs/models/vaes/wanvae.py Outdated
Comment thread python/sglang/multimodal_gen/configs/pipeline_configs/joy_image.py Outdated
)

prompt_template_encode = (
"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size,"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The \\n in the prompt template string appears to be a typo. It will result in a literal backslash followed by 'n' in the prompt, which is likely not intended. It should probably be a single \n or removed if the preceding newline is sufficient.

Suggested change
"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size,"
"<|im_start|>system\nDescribe the image by detailing the color, shape, size,"

)
image_index, video_index = 0, 0
attention_mask = attention_mask.to(total_input_ids.device)
for i, input_ids in enumerate(total_input_ids):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loop variable input_ids shadows the function argument input_ids. This is confusing and can lead to bugs, especially since the argument's device is accessed later in the function (e.g., at line 563). Using a distinct name like ids or input_id_seq would be safer.

Comment on lines +903 to +907
config = config.arch_config
self.model = Qwen3VLModel(config)
self.lm_head = nn.Linear(
config.text_config.hidden_size, config.text_config.vocab_size, bias=False
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Shadowing the config argument with config.arch_config makes the code harder to maintain and reason about. It is better to use a distinct variable name like arch_config to avoid confusion between the Qwen3VLConfig and Qwen3VLArchConfig objects.

Suggested change
config = config.arch_config
self.model = Qwen3VLModel(config)
self.lm_head = nn.Linear(
config.text_config.hidden_size, config.text_config.vocab_size, bias=False
)
arch_config = config.arch_config
self.model = Qwen3VLModel(arch_config)
self.lm_head = nn.Linear(
arch_config.text_config.hidden_size, arch_config.text_config.vocab_size, bias=False
)

chengyusong1 added 2 commits April 12, 2026 15:35
@lahmuller
Copy link
Copy Markdown
Contributor Author

Hi @mickqian, friendly ping. This PR is now ready for review when you have time. Thanks!

# encoder hidden state
prompt_embeds = qwen_image_postprocess_text(outputs, image_inputs, 64)
return prompt_embeds
def encoding_image_edit(self, outputs, image_inputs, pipeline_config):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to regress the existing Qwen-Image-Edit path. Previously image edit called qwen_image_postprocess_text(..., drop_idx=64), but this generic call falls back to the default drop_idx=34. That changes the conditioning tokens for existing Qwen image-edit models. Could we keep an edit-specific postprocess wrapper or make the drop index configurable per pipeline?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your careful review, I fixed it by restoring edit-specific behavior: Qwen image-edit now uses drop_idx=64 again.
Fixed in 01b1e61

def post_denoising_loop(self, latents, batch):
lt, lh, lw = batch.vae_image_sizes[0]
target_len = lt * lh * lw
target_patches = latents[0, :target_len]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always selects latents[0], so batch requests or num_outputs_per_prompt > 1 will silently drop every output except the first one. Could this preserve the batch dimension, e.g. slice latents[:, :target_len] and rearrange with a leading b dimension, or explicitly reject batch sizes > 1 if unsupported?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch — fixed in e9400a9.

This update is not only in post_denoising_loop: it also aligns condition latents to batch.batch_size in postprocess_image_latent, and aligns encoder_hidden_states/mask batch in JoyTransformer3DModel with strict mismatch checks.

So batched requests (num_outputs_per_prompt > 1) are handled consistently end-to-end.

BBuf
BBuf previously requested changes Apr 17, 2026
Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please solve above comments

@lahmuller lahmuller requested a review from BBuf April 20, 2026 04:14
@lahmuller
Copy link
Copy Markdown
Contributor Author

@BBuf Thanks again for your comments — I’ve addressed both of them and replied in each thread. When you have a moment, could you please take another look and continue the review/approve if everything looks good? Really appreciate it 🙏

@Prozac614
Copy link
Copy Markdown
Contributor

Prozac614 commented Apr 23, 2026

Could you add an example to the PR showing your output alongside diffusers output, so we can verify correctness?

We also need a CI test case for this. Please refer to the existing CI tests and add one in the same style.

@lahmuller
Copy link
Copy Markdown
Contributor Author

lahmuller commented Apr 24, 2026

Could you add an example to the PR showing your output alongside diffusers output, so we can verify correctness?

We also need a CI test case for this. Please refer to the existing CI tests and add one in the same style.

Thanks for the suggestion — I’ve added an example comparison between SGLang and diffusers outputs using the same input/setting for correctness check.
the diffusers output was generated by following the usage instructions from the official JoyAI model page: jdopensource/JoyAI-Image-Edit-Diffusers.
The SGLang output was generated with the command below:

sglang generate
--model-path jdopensource/JoyAI-Image-Edit-Diffusers
--num-gpus 1
--sp-degree 1
--prompt "Remove the construction structure from the top of the crane."
--image-path "${input_image_path}"
--output-path "${output_dir}"
--output-file-name "${output_file_name}"
--save-output
--seed 0
--guidance-scale 4.0
--num-inference-steps 30

input image prompt diffusers this PR
image Remove the construction structure
from the top of the crane.
image image

I’m now adding a CI test case in the existing diffusion CI style, and will push it in a follow-up commit shortly.

@mickqian
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@lahmuller
Copy link
Copy Markdown
Contributor Author

Added a JoyAI image-edit 1-GPU diffusion CI case following the existing config-driven test style. It currently runs as a smoke generation case with perf/consistency checks disabled.

@lahmuller
Copy link
Copy Markdown
Contributor Author

lahmuller commented Apr 24, 2026

Update: I learned from the Diffusers PR author that the upstream weight names may still change before huggingface/diffusers#13444 is merged. Since this SGLang integration includes weight-loading mappings, I think it may be safer to align this PR with the final upstream weight format before merging.

I’ll keep the CI case and implementation ready, but may update the loader mappings once the Diffusers PR/model repo format is finalized.

@Prozac614
Copy link
Copy Markdown
Contributor

Added a JoyAI image-edit 1-GPU diffusion CI case following the existing config-driven test style. It currently runs as a smoke generation case with perf/consistency checks disabled.

You can turn on the perf flag, run it once, and then paste the perf data into perf_baseline.json.

@lahmuller
Copy link
Copy Markdown
Contributor Author

Added a JoyAI image-edit 1-GPU diffusion CI case following the existing config-driven test style. It currently runs as a smoke generation case with perf/consistency checks disabled.

You can turn on the perf flag, run it once, and then paste the perf data into perf_baseline.json.

I enabled the perf check, but no perf baseline was produced because joyai_image_edit_ti2i was skipped.

The failure is before generation: importing the custom Joy transformer fails with ModuleNotFoundError: sglang.multimodal_gen.runtime.utils.layerwise_offload, then fallback to diffusers fails because JoyImageEditTransformer3DModel is not in diffusers. The file exists in the PR head, so this looks like a CI checkout/install/environment issue rather than a perf-baseline issue.

Could you help rerun/check the multimodal-gen 1-gpu CI environment? After the case actually runs, I’ll paste the generated baseline into perf_baselines.json.

@mickqian
Copy link
Copy Markdown
Collaborator

mickqian commented May 2, 2026

are we ready with this PR?

Copy link
Copy Markdown
Collaborator

@mickqian mickqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you update the doc if you have time?

@lahmuller
Copy link
Copy Markdown
Contributor Author

are we ready with this PR?

Yes, I believe we are ready now.

could you update the doc if you have time?

Sure. Do you mean adding JoyAI-Image-Edit to the diffusion documentation / supported model list, or is there a specific doc page you want updated?

@mickqian mickqian merged commit 5ec3b26 into sgl-project:main May 2, 2026
70 of 78 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Co-authored-by: chengyusong1 <chengyusong1@jd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants