[diffusion] model: support JoyAI-Image-Edit#22625
[diffusion] model: support JoyAI-Image-Edit#22625mickqian merged 13 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the JoyImage model, adding the DiT architecture, Qwen3-VL text encoder, and associated pipeline and sampling configurations. The implementation includes new configuration classes, runtime model definitions, and registration within the multimodal generation framework. Review feedback highlights several improvement opportunities, including refactoring the VAE configuration to follow standard dataclass patterns, moving stateful bucket initialization to prevent side effects during request processing, fixing a typo in a prompt template, and resolving variable shadowing in the encoder implementation to enhance maintainability.
| ) | ||
|
|
||
| prompt_template_encode = ( | ||
| "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size," |
There was a problem hiding this comment.
The \\n in the prompt template string appears to be a typo. It will result in a literal backslash followed by 'n' in the prompt, which is likely not intended. It should probably be a single \n or removed if the preceding newline is sufficient.
| "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size," | |
| "<|im_start|>system\nDescribe the image by detailing the color, shape, size," |
| ) | ||
| image_index, video_index = 0, 0 | ||
| attention_mask = attention_mask.to(total_input_ids.device) | ||
| for i, input_ids in enumerate(total_input_ids): |
There was a problem hiding this comment.
| config = config.arch_config | ||
| self.model = Qwen3VLModel(config) | ||
| self.lm_head = nn.Linear( | ||
| config.text_config.hidden_size, config.text_config.vocab_size, bias=False | ||
| ) |
There was a problem hiding this comment.
Shadowing the config argument with config.arch_config makes the code harder to maintain and reason about. It is better to use a distinct variable name like arch_config to avoid confusion between the Qwen3VLConfig and Qwen3VLArchConfig objects.
| config = config.arch_config | |
| self.model = Qwen3VLModel(config) | |
| self.lm_head = nn.Linear( | |
| config.text_config.hidden_size, config.text_config.vocab_size, bias=False | |
| ) | |
| arch_config = config.arch_config | |
| self.model = Qwen3VLModel(arch_config) | |
| self.lm_head = nn.Linear( | |
| arch_config.text_config.hidden_size, arch_config.text_config.vocab_size, bias=False | |
| ) |
|
Hi @mickqian, friendly ping. This PR is now ready for review when you have time. Thanks! |
| # encoder hidden state | ||
| prompt_embeds = qwen_image_postprocess_text(outputs, image_inputs, 64) | ||
| return prompt_embeds | ||
| def encoding_image_edit(self, outputs, image_inputs, pipeline_config): |
There was a problem hiding this comment.
This seems to regress the existing Qwen-Image-Edit path. Previously image edit called qwen_image_postprocess_text(..., drop_idx=64), but this generic call falls back to the default drop_idx=34. That changes the conditioning tokens for existing Qwen image-edit models. Could we keep an edit-specific postprocess wrapper or make the drop index configurable per pipeline?
There was a problem hiding this comment.
Thanks for your careful review, I fixed it by restoring edit-specific behavior: Qwen image-edit now uses drop_idx=64 again.
Fixed in 01b1e61
| def post_denoising_loop(self, latents, batch): | ||
| lt, lh, lw = batch.vae_image_sizes[0] | ||
| target_len = lt * lh * lw | ||
| target_patches = latents[0, :target_len] |
There was a problem hiding this comment.
This always selects latents[0], so batch requests or num_outputs_per_prompt > 1 will silently drop every output except the first one. Could this preserve the batch dimension, e.g. slice latents[:, :target_len] and rearrange with a leading b dimension, or explicitly reject batch sizes > 1 if unsupported?
There was a problem hiding this comment.
Great catch — fixed in e9400a9.
This update is not only in post_denoising_loop: it also aligns condition latents to batch.batch_size in postprocess_image_latent, and aligns encoder_hidden_states/mask batch in JoyTransformer3DModel with strict mismatch checks.
So batched requests (num_outputs_per_prompt > 1) are handled consistently end-to-end.
BBuf
left a comment
There was a problem hiding this comment.
Please solve above comments
9487adc to
e9400a9
Compare
|
@BBuf Thanks again for your comments — I’ve addressed both of them and replied in each thread. When you have a moment, could you please take another look and continue the review/approve if everything looks good? Really appreciate it 🙏 |
|
Could you add an example to the PR showing your output alongside diffusers output, so we can verify correctness? We also need a CI test case for this. Please refer to the existing CI tests and add one in the same style. |
Thanks for the suggestion — I’ve added an example comparison between SGLang and diffusers outputs using the same input/setting for correctness check.
I’m now adding a CI test case in the existing diffusion CI style, and will push it in a follow-up commit shortly. |
|
/tag-and-rerun-ci |
|
Added a JoyAI image-edit 1-GPU diffusion CI case following the existing config-driven test style. It currently runs as a smoke generation case with perf/consistency checks disabled. |
|
Update: I learned from the Diffusers PR author that the upstream weight names may still change before huggingface/diffusers#13444 is merged. Since this SGLang integration includes weight-loading mappings, I think it may be safer to align this PR with the final upstream weight format before merging. I’ll keep the CI case and implementation ready, but may update the loader mappings once the Diffusers PR/model repo format is finalized. |
You can turn on the perf flag, run it once, and then paste the perf data into perf_baseline.json. |
I enabled the perf check, but no perf baseline was produced because The failure is before generation: importing the custom Joy transformer fails with Could you help rerun/check the multimodal-gen 1-gpu CI environment? After the case actually runs, I’ll paste the generated baseline into |
|
are we ready with this PR? |
mickqian
left a comment
There was a problem hiding this comment.
could you update the doc if you have time?
Yes, I believe we are ready now.
Sure. Do you mean adding JoyAI-Image-Edit to the diffusion documentation / supported model list, or is there a specific doc page you want updated? |
Co-authored-by: chengyusong1 <chengyusong1@jd.com>



Motivation
We are the JoyAI Team. This PR adds SGLang multimodal generation support for JoyAI-Image-Edit to enable image editing inference with JoyAI architecture in the existing diffusion pipeline framework. JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT).
Model: https://huggingface.co/jdopensource/JoyAI-Image-Edit-Diffusers
Modifications
JoyImageEditPipelineConfig+JoyImageEditSamplingParams.JoyImageArchConfig/JoyImageDiTConfigJoyTransformer3DModeland Joy image-edit transformer path.Qwen3VLConfigJoyImageEditPipelineConfigJoyImageEditPipelinewith standard TI2I stage composition.postprocess_text_funcs.guidance_scale=4.0,num_inference_steps=40,num_frames=1, empty negative prompt.enable_sequence_shardfor Joy pipelines.ImagePipelineConfig.shard_latents_for_sp, bypass additional latent sharding when sequence shard is already enabled.WanVAEConfig.get_vae_scale_factor()andpost_init()fields for unified downstream scale-factor usage.Accuracy Tests
N/A for this PR.
This change mainly integrates a new model/pipeline into the framework and does not modify existing model forward logic or kernel behavior.
Speed Tests and Profiling
N/A for this PR.
No speed regression/performance optimization is claimed in this change.
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci