Skip to content

Enable renormalize(naive) routing for fp8 per-tensor#2030

Merged
IwakuraRein merged 2 commits intomainfrom
fp8-renormalize-routing
Nov 11, 2025
Merged

Enable renormalize(naive) routing for fp8 per-tensor#2030
IwakuraRein merged 2 commits intomainfrom
fp8-renormalize-routing

Conversation

@IwakuraRein
Copy link
Copy Markdown
Collaborator

@IwakuraRein IwakuraRein commented Nov 3, 2025

📌 Description

Disable expert weights in the FC1 except for Llama routing.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Made token_scales available for Llama4 routing.
    • Corrected GEMM1 input so the proper data is used during MoE processing.
  • Tests

    • Added FP8PerTensorMoe to test parameterization.
    • Expanded coverage for Renormalize, DeepSeekV3, Qwen3 and Llama4 routing configurations.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 3, 2025

Walkthrough

Expose a new token_scales pointer in the MoE workspace, set it from expert_weights.data_ptr() in launcher paths when routing_method_type is Llama4, consume token_scales in the PermuteGemm1 call instead of expert_weights, and add FP8PerTensorMoe to MoE test parameterizations.

Changes

Cohort / File(s) Summary
Kernel launcher modifications
csrc/trtllm_fused_moe_kernel_launcher.cu
In Fp8PerTensorLauncher::prepare_routing, when routing_method_type == Llama4 assign workspace.token_scales = expert_weights.data_ptr() so token_scales can be consumed by permuteGemm1; no other flow or error-handling changes.
Runner modifications
csrc/trtllm_fused_moe_runner.cu
In Runner::run, change the mPermuteGemm1.run call to pass workspace.token_scales (instead of workspace.expert_weights) as the GEMM1 input and adjust subsequent argument ordering accordingly.
Header / workspace struct
include/flashinfer/trtllm/fused_moe/runner.h
Add void* token_scales = nullptr; to MoEWorkspace with comment indicating it is consumed by the permuteGemm1 kernel; clarify expert_weights remains used by finalize.
Test coverage expansion
tests/moe/test_trtllm_gen_fused_moe.py
Add FP8PerTensorMoe (id FP8_Tensor) to parameterized MoE tests and include it among compatible implementations for Qwen3, Renormalize, Qwen3_next, DeepSeekV3, and Llama4 routing scenarios.

Sequence Diagram

sequenceDiagram
    autonumber
    participant Launcher as Kernel Launcher
    participant Workspace as MoE Workspace
    participant Runner as PermuteGemm1
    participant Finalize as Finalize Kernel

    Note over Launcher: check routing_method_type
    alt routing_method_type == Llama4
        Launcher->>Workspace: workspace.token_scales = expert_weights.data_ptr()
    else other routing
        Launcher->>Workspace: token_scales left nullptr/unmodified
    end

    Launcher->>Workspace: prepare workspace
    Workspace->>Runner: mPermuteGemm1.run(input = token_scales, ...)
    Runner->>Runner: permute / GEMM using token_scales
    Workspace->>Finalize: provide expert_weights for finalize
    Finalize->>Finalize: finalize outputs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus areas:
    • Verify the Llama4 conditional is applied in all relevant launcher paths.
    • Confirm token_scales lifetime and memory layout from expert_weights.data_ptr() matches permuteGemm1 expectations.
    • Check mPermuteGemm1.run argument reordering for correct parameter mapping.
    • Review added test parametrizations for correctness and coverage.

Suggested reviewers

  • cyx-6
  • wenscarl
  • djmmoss
  • yzh119
  • joker-eph
  • aleozlx
  • jiahanc

Poem

🐇 I found a pointer, light and neat,
I set token_scales to guide the feat.
GEMM1 takes a hop, the kernels sing,
Tests applaud the tiny spring.
🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'Enable renormalize(naive) routing for fp8 per-tensor' does not match the PR description which states 'Disable expert weights in the FC1 except for Llama routing,' indicating a mismatch between the stated objective and the actual description. Align the PR title and description to clearly communicate the actual change. Either update the title to reflect disabling expert weights in FC1, or update the description to clarify the renormalize routing enablement.
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The PR description is vague and generic; it states 'Disable expert weights in the FC1 except for Llama routing' but lacks specific details about what changes were made, why they're needed, or how they relate to the actual code modifications shown in the raw_summary. Provide a clear, detailed description explaining the purpose of these changes, which files were modified, and how the modifications enable FP8 per-tensor routing for the specified routing methods.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fp8-renormalize-routing

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2b7bfc and b2d228c.

📒 Files selected for processing (1)
  • csrc/trtllm_fused_moe_kernel_launcher.cu (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • csrc/trtllm_fused_moe_kernel_launcher.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@IwakuraRein IwakuraRein marked this pull request as ready for review November 10, 2025 17:29
@IwakuraRein IwakuraRein force-pushed the fp8-renormalize-routing branch 2 times, most recently from 5a48ddf to 17fb87a Compare November 10, 2025 17:55
Copy link
Copy Markdown
Collaborator

@jiahanc jiahanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the contribution!

@IwakuraRein IwakuraRein force-pushed the fp8-renormalize-routing branch from 17fb87a to d42fb90 Compare November 10, 2025 18:04
@jiahanc
Copy link
Copy Markdown
Collaborator

jiahanc commented Nov 10, 2025

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !124 has been created, and the CI pipeline #38224308 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: siyuanf <siyuanf@nvidia.com>
@IwakuraRein IwakuraRein reopened this Nov 10, 2025
@jiahanc
Copy link
Copy Markdown
Collaborator

jiahanc commented Nov 10, 2025

/bot run

@IwakuraRein IwakuraRein enabled auto-merge (squash) November 10, 2025 18:09
@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !125 has been created, and the CI pipeline #38224491 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
@jiahanc
Copy link
Copy Markdown
Collaborator

jiahanc commented Nov 10, 2025

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !125 has been updated with latest changes, and the CI pipeline #38236290 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Collaborator

@djmmoss djmmoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@IwakuraRein IwakuraRein merged commit fbdb439 into main Nov 11, 2025
4 checks passed
@IwakuraRein IwakuraRein deleted the fp8-renormalize-routing branch November 11, 2025 02:19
@coderabbitai coderabbitai Bot mentioned this pull request Feb 23, 2026
5 tasks
BingooYang pushed a commit to BingooYang/flashinfer that referenced this pull request Mar 13, 2026
)

<!-- .github/pull_request_template.md -->

## 📌 Description

Disable expert weights in the FC1 except for Llama routing.

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
  * Re-enabled Renormalize routing that was previously blocked.
  * Made token_scales available for Llama4 routing.
* Corrected GEMM1 input so the proper data source is used during MoE
processing.

* **Tests**
  * Added FP8PerTensorMoe to test parameterization.
* Expanded Renormalize and DeepSeekV3 test coverage and removed related
skips.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: siyuanf <siyuanf@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants