[diffusion] model: support JoyAI-Image-Edit by lahmuller · Pull Request #22625 · sgl-project/sglang

lahmuller · 2026-04-12T07:21:14Z

Motivation

We are the JoyAI Team. This PR adds SGLang multimodal generation support for JoyAI-Image-Edit to enable image editing inference with JoyAI architecture in the existing diffusion pipeline framework. JoyAI-Image is a unified multimodal foundation model for image understanding, text-to-image generation, and instruction-guided image editing. It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT).
Model: https://huggingface.co/jdopensource/JoyAI-Image-Edit-Diffusers

Modifications

Added JoyAI model registration in multimodal registry:
- New HF model path detection and config mapping for JoyImageEditPipelineConfig + JoyImageEditSamplingParams.
Added JoyImage DiT config and runtime model implementation:
- JoyImageArchConfig / JoyImageDiTConfig
- Runtime JoyTransformer3DModel and Joy image-edit transformer path.
Added Qwen3-VL text encoder support for JoyImage:
- New encoder config Qwen3VLConfig
- Inference runtime implementation for Qwen3-VL text/vision-conditioned encoding.
Added JoyImage image-edit pipeline config and runtime pipeline:
- JoyImageEditPipelineConfig
- JoyImageEditPipeline with standard TI2I stage composition.
Generalized image-edit text postprocessing in image encoding stage:
- Replaced Qwen-image-specific postprocess path with pipeline-configured postprocess_text_funcs.
Added JoyImage sampling defaults:
- guidance_scale=4.0, num_inference_steps=40, num_frames=1, empty negative prompt.
Added Joy-related runtime compatibility updates:
- Auto-enable enable_sequence_shard for Joy pipelines.
- In ImagePipelineConfig.shard_latents_for_sp, bypass additional latent sharding when sequence shard is already enabled.
- Added WanVAEConfig.get_vae_scale_factor() and post_init() fields for unified downstream scale-factor usage.

Accuracy Tests

N/A for this PR.
This change mainly integrates a new model/pipeline into the framework and does not modify existing model forward logic or kernel behavior.

Speed Tests and Profiling

N/A for this PR.
No speed regression/performance optimization is claimed in this change.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed. (Marked as N/A for this model integration PR)
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request introduces support for the JoyImage model, adding the DiT architecture, Qwen3-VL text encoder, and associated pipeline and sampling configurations. The implementation includes new configuration classes, runtime model definitions, and registration within the multimodal generation framework. Review feedback highlights several improvement opportunities, including refactoring the VAE configuration to follow standard dataclass patterns, moving stateful bucket initialization to prevent side effects during request processing, fixing a typo in a prompt template, and resolving variable shadowing in the encoder implementation to enhance maintainability.

gemini-code-assist · 2026-04-12T07:23:25Z

+            )
+
+        prompt_template_encode = (
+            "<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size,"


The \\n in the prompt template string appears to be a typo. It will result in a literal backslash followed by 'n' in the prompt, which is likely not intended. It should probably be a single \n or removed if the preceding newline is sufficient.

Suggested change

"<|im_start|>system\n \\nDescribe the image by detailing the color, shape, size,"

"<|im_start|>system\nDescribe the image by detailing the color, shape, size,"

gemini-code-assist · 2026-04-12T07:23:25Z

+            )
+            image_index, video_index = 0, 0
+            attention_mask = attention_mask.to(total_input_ids.device)
+            for i, input_ids in enumerate(total_input_ids):


The loop variable input_ids shadows the function argument input_ids. This is confusing and can lead to bugs, especially since the argument's device is accessed later in the function (e.g., at line 563). Using a distinct name like ids or input_id_seq would be safer.

gemini-code-assist · 2026-04-12T07:23:25Z

+        config = config.arch_config
+        self.model = Qwen3VLModel(config)
+        self.lm_head = nn.Linear(
+            config.text_config.hidden_size, config.text_config.vocab_size, bias=False
+        )


Shadowing the config argument with config.arch_config makes the code harder to maintain and reason about. It is better to use a distinct variable name like arch_config to avoid confusion between the Qwen3VLConfig and Qwen3VLArchConfig objects.

Suggested change

config = config.arch_config

self.model = Qwen3VLModel(config)

self.lm_head = nn.Linear(

config.text_config.hidden_size, config.text_config.vocab_size, bias=False

)

arch_config = config.arch_config

self.model = Qwen3VLModel(arch_config)

self.lm_head = nn.Linear(

arch_config.text_config.hidden_size, arch_config.text_config.vocab_size, bias=False

)

lahmuller · 2026-04-14T02:51:30Z

Hi @mickqian, friendly ping. This PR is now ready for review when you have time. Thanks!

BBuf · 2026-04-17T07:15:15Z

-        # encoder hidden state
-        prompt_embeds = qwen_image_postprocess_text(outputs, image_inputs, 64)
-        return prompt_embeds
+    def encoding_image_edit(self, outputs, image_inputs, pipeline_config):


This seems to regress the existing Qwen-Image-Edit path. Previously image edit called qwen_image_postprocess_text(..., drop_idx=64), but this generic call falls back to the default drop_idx=34. That changes the conditioning tokens for existing Qwen image-edit models. Could we keep an edit-specific postprocess wrapper or make the drop index configurable per pipeline?

Thanks for your careful review, I fixed it by restoring edit-specific behavior: Qwen image-edit now uses drop_idx=64 again.
Fixed in 01b1e61

BBuf · 2026-04-17T07:16:24Z

+    def post_denoising_loop(self, latents, batch):
+        lt, lh, lw = batch.vae_image_sizes[0]
+        target_len = lt * lh * lw
+        target_patches = latents[0, :target_len]


This always selects latents[0], so batch requests or num_outputs_per_prompt > 1 will silently drop every output except the first one. Could this preserve the batch dimension, e.g. slice latents[:, :target_len] and rearrange with a leading b dimension, or explicitly reject batch sizes > 1 if unsupported?

Great catch — fixed in e9400a9.

This update is not only in post_denoising_loop: it also aligns condition latents to batch.batch_size in postprocess_image_latent, and aligns encoder_hidden_states/mask batch in JoyTransformer3DModel with strict mismatch checks.

So batched requests (num_outputs_per_prompt > 1) are handled consistently end-to-end.

BBuf

Please solve above comments

lahmuller · 2026-04-21T06:40:44Z

@BBuf Thanks again for your comments — I’ve addressed both of them and replied in each thread. When you have a moment, could you please take another look and continue the review/approve if everything looks good? Really appreciate it 🙏

Prozac614 · 2026-04-23T06:24:47Z

Could you add an example to the PR showing your output alongside diffusers output, so we can verify correctness?

We also need a CI test case for this. Please refer to the existing CI tests and add one in the same style.

lahmuller · 2026-04-24T06:59:45Z

Could you add an example to the PR showing your output alongside diffusers output, so we can verify correctness?

We also need a CI test case for this. Please refer to the existing CI tests and add one in the same style.

Thanks for the suggestion — I’ve added an example comparison between SGLang and diffusers outputs using the same input/setting for correctness check.
the diffusers output was generated by following the usage instructions from the official JoyAI model page: jdopensource/JoyAI-Image-Edit-Diffusers.
The SGLang output was generated with the command below:

sglang generate
--model-path jdopensource/JoyAI-Image-Edit-Diffusers
--num-gpus 1
--sp-degree 1
--prompt "Remove the construction structure from the top of the crane."
--image-path "${input_image_path}"
--output-path "${output_dir}"
--output-file-name "${output_file_name}"
--save-output
--seed 0
--guidance-scale 4.0
--num-inference-steps 30

input image	prompt	diffusers	this PR
	Remove the construction structure from the top of the crane.

I’m now adding a CI test case in the existing diffusion CI style, and will push it in a follow-up commit shortly.

mickqian · 2026-04-24T07:49:35Z

/tag-and-rerun-ci

lahmuller · 2026-04-24T15:03:02Z

Added a JoyAI image-edit 1-GPU diffusion CI case following the existing config-driven test style. It currently runs as a smoke generation case with perf/consistency checks disabled.

lahmuller · 2026-04-24T15:04:18Z

Update: I learned from the Diffusers PR author that the upstream weight names may still change before huggingface/diffusers#13444 is merged. Since this SGLang integration includes weight-loading mappings, I think it may be safer to align this PR with the final upstream weight format before merging.

I’ll keep the CI case and implementation ready, but may update the loader mappings once the Diffusers PR/model repo format is finalized.

Prozac614 · 2026-04-27T06:26:38Z

Added a JoyAI image-edit 1-GPU diffusion CI case following the existing config-driven test style. It currently runs as a smoke generation case with perf/consistency checks disabled.

You can turn on the perf flag, run it once, and then paste the perf data into perf_baseline.json.

lahmuller · 2026-05-01T13:48:37Z

Added a JoyAI image-edit 1-GPU diffusion CI case following the existing config-driven test style. It currently runs as a smoke generation case with perf/consistency checks disabled.

You can turn on the perf flag, run it once, and then paste the perf data into perf_baseline.json.

I enabled the perf check, but no perf baseline was produced because joyai_image_edit_ti2i was skipped.

The failure is before generation: importing the custom Joy transformer fails with ModuleNotFoundError: sglang.multimodal_gen.runtime.utils.layerwise_offload, then fallback to diffusers fails because JoyImageEditTransformer3DModel is not in diffusers. The file exists in the PR head, so this looks like a CI checkout/install/environment issue rather than a perf-baseline issue.

Could you help rerun/check the multimodal-gen 1-gpu CI environment? After the case actually runs, I’ll paste the generated baseline into perf_baselines.json.

mickqian · 2026-05-02T01:23:27Z

are we ready with this PR?

mickqian

could you update the doc if you have time?

lahmuller · 2026-05-02T03:40:02Z

are we ready with this PR?

Yes, I believe we are ready now.

could you update the doc if you have time?

Sure. Do you mean adding JoyAI-Image-Edit to the diffusion documentation / supported model list, or is there a specific doc page you want updated?

updated

Co-authored-by: chengyusong1 <chengyusong1@jd.com>

[diffusion] model: support JoyAI-Image-Edit

79938fb

lahmuller requested review from BBuf, mickqian, ping1jing2, yhyang201 and yingluosanqian as code owners April 12, 2026 07:21

github-actions Bot added the diffusion SGLang Diffusion label Apr 12, 2026

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

chengyusong1 added 2 commits April 12, 2026 15:35

upd

13882ad

upd

5beb945

BBuf reviewed Apr 17, 2026

View reviewed changes

BBuf previously requested changes Apr 17, 2026

View reviewed changes

lahmuller added 2 commits April 18, 2026 19:03

upd

01b1e61

upd

e9400a9

lahmuller force-pushed the joyimage-edit branch from 9487adc to e9400a9 Compare April 19, 2026 10:25

lahmuller requested a review from BBuf April 20, 2026 04:14

github-actions Bot added the run-ci label Apr 24, 2026

mickqian approved these changes Apr 24, 2026

View reviewed changes

Prozac614 approved these changes Apr 24, 2026

View reviewed changes

mickqian and others added 2 commits April 24, 2026 17:48

Merge branch 'main' into joyimage-edit

69c7318

Add JoyAI image edit diffusion CI case

0757e90

lahmuller added 2 commits May 1, 2026 16:07

align with diffusers weight name and img resize function

27a7e62

turn on perf flag

65502a0

mickqian and others added 3 commits May 1, 2026 22:53

Merge branch 'main' into joyimage-edit

ecd1f85

update OffloadableDiTMixin import path

b51c9d3

upd

0028443

mickqian reviewed May 2, 2026

View reviewed changes

add joyai_image_edit perf data

23d5638

mickqian approved these changes May 2, 2026

View reviewed changes

mickqian merged commit 5ec3b26 into sgl-project:main May 2, 2026
70 of 78 checks passed

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

[diffusion] model: support JoyAI-Image-Edit (sgl-project#22625)

0db91c6

Co-authored-by: chengyusong1 <chengyusong1@jd.com>

	"<\|im_start\|>system\n \\nDescribe the image by detailing the color, shape, size,"
	"<\|im_start\|>system\nDescribe the image by detailing the color, shape, size,"

Conversation

lahmuller commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

lahmuller commented Apr 14, 2026

Uh oh!

BBuf Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lahmuller Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

lahmuller Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

lahmuller commented Apr 21, 2026

Uh oh!

Prozac614 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lahmuller commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickqian commented Apr 24, 2026

Uh oh!

lahmuller commented Apr 24, 2026

Uh oh!

lahmuller commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Prozac614 commented Apr 27, 2026

Uh oh!

lahmuller commented May 1, 2026

Uh oh!

mickqian commented May 2, 2026

Uh oh!

mickqian left a comment

Choose a reason for hiding this comment

Uh oh!

lahmuller commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lahmuller commented Apr 12, 2026 •

edited

Loading

Prozac614 commented Apr 23, 2026 •

edited

Loading

lahmuller commented Apr 24, 2026 •

edited

Loading

lahmuller commented Apr 24, 2026 •

edited

Loading