Skip to content

[Diffusion][NPU] Add support for Hunyuan3D#20352

Merged
sglang-npu-bot merged 4 commits intosgl-project:mainfrom
e-martirosian:hunyuan3d_npu_support
Mar 24, 2026
Merged

[Diffusion][NPU] Add support for Hunyuan3D#20352
sglang-npu-bot merged 4 commits intosgl-project:mainfrom
e-martirosian:hunyuan3d_npu_support

Conversation

@e-martirosian
Copy link
Copy Markdown
Contributor

@e-martirosian e-martirosian commented Mar 11, 2026

Motivation

This PR adds NPU support to the Hunyuan3D pipeline.

Modifications

  • Added support for building a custom rasterizer for CPU only.
  • Set to use CPU in rasterize when input tensors are on NPU.
  • Converted image_tensors to float32 (previously used double) in _run_delight - double is not supported on NPU.
  • Fixed device propagation in MeshRender.

Accuracy Tests

GPU

Before:

gpu_before

After:

gpu_after

NPU

Before։ pipeline failed.

After:

npu_after

Benchmarking and Profiling

GPU

The performance difference is within the error margin.

Devices: 1 x Nvidia A10
Command: sglang generate --model-path tencent/Hunyuan3D-2 --image-path ./assets/demo.png

Before: 546.98 seconds
After: 545.57 seconds

NPU

Devices: one chip of Ascend A3
Command: sglang generate --model-path tencent/Hunyuan3D-2 --image-path ./assets/demo.png

Generated in 518.22 seconds

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Mar 11, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces NPU (Neural Processing Unit) support for the Hunyuan3D pipeline by enhancing the custom rasterizer to handle different device types (CPU/CUDA) and ensuring proper tensor device placement. It also refines image tensor handling and renderer initialization for better compatibility and robustness.

Highlights

  • Conditional CUDA Kernel Loading: The custom rasterizer's CUDA kernel (rasterizer_gpu.cu) is now conditionally included in the build process only when torch.cuda.is_available() is true, improving build flexibility and reducing dependencies for CPU-only environments.
  • Enhanced Device Handling in Rasterizer: The rasterize function now explicitly manages tensor device placement. Input tensors are moved to the appropriate processing device (CPU or CUDA) and, if processed on a non-CUDA device, the results are moved back to the original device, ensuring correct operation across different hardware configurations.
  • Conditional Compilation for C++ Rasterizer: Conditional compilation directives (#ifdef __CUDACC__) have been added to the C++ rasterizer implementation (rasterizer.cpp) to correctly dispatch between CPU and GPU rasterization functions based on whether the code is being compiled with a CUDA compiler.
  • Macro Definitions for Non-CUDA Builds: The __host__ and __device__ macros are now conditionally defined in rasterizer.h when __CUDACC__ is not present, preventing compilation errors in environments without CUDA support.
  • Explicit Float Casting for Image Tensors: Image tensors in the hunyuan3d_paint stage are now explicitly cast to float type during conversion from NumPy arrays, ensuring consistent data types for subsequent operations.
  • Device Assignment for Mesh Renderer: The MeshRender constructor now accepts a device argument, allowing for explicit device assignment during renderer initialization, which is crucial for multi-device environments.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/multimodal_gen/csrc/render/hunyuan3d_rasterizer/init.py
    • Updated _load_custom_rasterizer to conditionally include rasterizer_gpu.cu in sources based on torch.cuda.is_available().
    • Modified rasterize function to determine the processing device, move input tensors (pos, tri, clamp_depth) to that device, and move output tensors (findices, barycentric) back to the original device if not CUDA.
  • python/sglang/multimodal_gen/csrc/render/hunyuan3d_rasterizer/rasterizer.cpp
    • Wrapped the device-checking logic for calling rasterize_image_cpu or rasterize_image_gpu within an #ifdef __CUDACC__ block, falling back to rasterize_image_cpu if __CUDACC__ is not defined.
  • python/sglang/multimodal_gen/csrc/render/hunyuan3d_rasterizer/rasterizer.h
    • Added conditional definitions for __host__ and __device__ macros when __CUDACC__ is not defined, to ensure compatibility during compilation.
  • python/sglang/multimodal_gen/runtime/pipelines_core/stages/hunyuan3d_paint.py
    • Modified _run_delight to explicitly cast image_tensor to float type when converting NumPy arrays to PyTorch tensors.
    • Added device=self.device argument to the MeshRender constructor in _init_renderer.
Activity
  • No specific activity (comments, reviews, or CI status) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adds support for non-CUDA devices (CPU) to the Hunyuan3D pipeline, utilizing conditional compilation and proper device placement for tensors. A comprehensive security audit found no significant security vulnerabilities, confirming that the changes primarily focus on device compatibility and type consistency without introducing new security risks. The implementation is well-executed, and no further improvements are suggested.

Comment thread python/sglang/multimodal_gen/csrc/render/hunyuan3d_rasterizer/__init__.py Outdated
@e-martirosian e-martirosian force-pushed the hunyuan3d_npu_support branch from 41d0a95 to a057758 Compare March 16, 2026 07:36
@e-martirosian e-martirosian force-pushed the hunyuan3d_npu_support branch from 8ac04c0 to 6507021 Compare March 16, 2026 09:10
Comment thread python/sglang/multimodal_gen/configs/pipeline_configs/base.py Outdated
@ssshinigami
Copy link
Copy Markdown
Contributor

please add more description and generated results to show it works and latency of generation

@ping1jing2 ping1jing2 changed the title [NPU] Add support for Hunyuan3D [Diffusion][NPU] Add support for Hunyuan3D Mar 18, 2026
Copy link
Copy Markdown
Contributor

@ssshinigami ssshinigami left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@e-martirosian e-martirosian marked this pull request as ready for review March 19, 2026 14:03
"""Rasterize mesh to get face indices and barycentric coordinates."""
kernel = _load_custom_rasterizer()
device = "cpu" if pos.device.type == "npu" else pos.device.type
kernel = _load_custom_rasterizer(device == "cuda")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it also work for other hardware backends such as AMD?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only check for NPU and run the custom kernel on CPU. This is intentional — we don’t expect it to work on other backends. Developers supporting other hardware can decide to implement this custom kernel for their backend or just run this part on CPU. Otherwise, they might not even notice this part to optimize it.

@ping1jing2
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@sglang-npu-bot sglang-npu-bot merged commit 9f4d8ac into sgl-project:main Mar 24, 2026
108 of 122 checks passed
adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026
Co-authored-by: Elizaveta Martirosian <elizaveta.martirosian@gmail.com>
adityavaid pushed a commit to adityavaid/sglang that referenced this pull request Mar 24, 2026
Co-authored-by: Elizaveta Martirosian <elizaveta.martirosian@gmail.com>
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
Co-authored-by: Elizaveta Martirosian <elizaveta.martirosian@gmail.com>
johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026
Co-authored-by: Elizaveta Martirosian <elizaveta.martirosian@gmail.com>
johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026
Co-authored-by: Elizaveta Martirosian <elizaveta.martirosian@gmail.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Co-authored-by: Elizaveta Martirosian <elizaveta.martirosian@gmail.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: Elizaveta Martirosian <elizaveta.martirosian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants